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Method of Identifying Nucleic Acids 


Related Applications 

This application claims priority to USSN 60/1 15,109, filed January 8, 1999, which is 
incorporated herein in its entirety. 

Field of the Invention 

The present invention relates to nucleic acids and more particularly to methods of 
equalizing the representation of nucleic acids in a population of nucleic acid molecules. 

Background of the Invention 

Approximately 10,000-20,000 genes are thought to be expressed within living cells, 
depending upon the specific cell type. RNAs corresponding to different genes can be present in 
different levels in cells. For example, transcripts from as few as 10-15 genes may represent 10- 
15% of cellular mRNA by mass, in addition to these highly abundant transcripts, another 1000- 
2000 genes encode moderately abundant transcripts, which can account for up to 50% of cellular 
mRNA mass. Transcripts from the remaining genes fall into the low abundance class. 

Because many genes are identified by isolating complementary DNA (cDNA) 
corresponding to an RNA sequence, a significant problem can arise because of differences in the 
levels at which specific RNAs are present in cell types. The most abundant sequences can be 
repeatedly sampled, while the lowest abundance class may be rarely, if ever, sampled. 

Several normalization and subtractive hybridization protocols have been developed to 
help overcome this problem. These techniques can be technically difficult to perform, and they 
can fail to detect cDNAs corresponding to rare transcripts. 

Summary of the Invention 

The invention is based in part on the discovery of novel procedures for equalizing, or 
normalizing, the representation of nucleic acids in a sample of nucleic acids in which different 
nucleic acids are initially present in the sample in unequal amounts. 

Accordingly, in one aspect the invention provides a method of screening a population of 
nucleic acid sequences. The method includes providing a population of nucleic acid sequences, 
partitioning the population into one or more subpopulations of nucleic acids, and identifying a 
first nucleic acid sequence having an increased level in the subpopulation relative to its level in 


15 
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the starting population of nucleic acids. The first nucleic acid is then compared to a reference 
nucleic acid sequence or sequences. The absence of the first nucleic acid sequence in the 
reference nucleic acid or nucleic acid sequences indicates the first nucleic acid is a novel nucleic 
acid sequence. 

5 The RNA can be derived from a plant, a single-celled animal, a multi-cellular animal, a 

bacterium, a virus, a fungus, or a yeast. If desired, the RNA can also be partitioned prior to 
synthesizing cDNA. 

Among the advantages of the methods are that they eliminate, or minimize, redundant 
identification and characterization of identical nucleic acid sequences in a population of nucleic 
10 acids.. 

In some embodiments, the cDNA is synthesized to selectively generate cDNA species 
that are enriched for those sequences oriented towards the 5'-terminus of the cDNA. In other 
embodiments, the cDNA is synthesized to enrich for those sequences oriented towards the 
3 '-terminus of the cDNA. 

15 In some embodiments, the population is normalized by digesting the cDNAs with one or 

more restriction endonucleases, in different reaction vessels, so as to generate segregated 
multiple partitions. Preferably, each specific digested cDNA-fragment will occur in only one 
partition. , 

In some embodiments, the cDNAs are partitioned by physical methods, which may 
20 optionally follow the restriction endonuclease digestion. The physical methods separate the 

cDNAs a function of their terminal nucleotide sequences, overall length and migratory pattern on 
a sizing matrix that possesses the ability to separate molecules as a function of their physical 
and/or biochemical properties. 

In other embodiments, the cDNAs are partitioned during subsequent PCR-based 
25 amplification of adapter-ligated cDNA fragments that have been digested with one or more 
restriction endonucleases. 

In other embodiments, the cDNAs are partitioned by screening the original mixture of 
cDNAs so as to remove those sequences that have already been characterized. Screening occurs 
using partitioned subtraction, whereby the original cDN As are brought into contact with a 
30 prepared, subtraction library of known sequence in such a way that any sequence contained 
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within the original library that is complimentary to any element of the subtraction library is 
removed or suppressed. 

cDNA sequences may also be partitioned by determining the size of each cDNA fragment 
prior to sequencing; biasing for formation of larger fragment PCR products by lariat formation. 
In this method, a bias for the larger fragment within the PCR reaction is introduced to allow 
efficient preferential amplification of longer fragments. Alternatively, partitioning may occur by 
preferentially amplifying 5' terminal or 3' terminal sequences of mRNA molecules. 

If desired, the amplified cDNAs may fractioned by separating the amplified cDNAs on a 
sizing matrix that separates molecules as a function of their physical and/or biochemical 
properties and excising individual cDN A fragments from said sizing matrix. The excised cDNA 
fragments are then inserted into a recombinant vector, or further amplified. 

In some embodiments, the restriction endonuclease is a restriction endonuclease that 
possesses a recognition sequence 4 to 8 basepairs in length and produces either a 5'- or 3-terminal 
overhang 0 to 6 basepairs in length. 

In some embodiments,: the identified sequence is subjected to computational analysis. 
The computational analysis can include querying, or searching, a nucleotide sequence database to 
identify sequences that match, or the absence of any sequences that match. The database 
includes a plurality of known nucleotide sequences of nucleic acids that may be present in the 
sample. 


Preferably, the nucleic acid database comprises substantially all the known, expressed 
nucleic acid sequences derived from a group comprising a plant, a single-celled animal, a multi- 
cellular animal, a bacterium, a virus, a fungus, or a yeast. 

In some embodiments, sizing includes diluting and re-amplification of the cDNAs, 
fractionating the re-amplified cDNAs by use of one or more sizing matrixes that separate the 

25 molecules as a function of their physical and/or biochemical characteristics, physically dividing 
or cutting the sizing matrixes into a plurality of sections, wherein each section is comprised of 
one or more cDNAs of similar molecular weight or size. The cDNAs are eluted from each of the 
sizing matrix section, ligated into a cloning vector and transformed into a host, e.g. , a bacterial 
host. A plurality of the transformed host colonies are selected so as to ensure a statistically- 

30 accurate representation of the cDNAs originally contained within the sizing matrix sections. The 
inserts from this plurality of colonies are recovered and their molecular weight or size of are 
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determined. A plurality of insert DNAs, wherein each successive insert has a molecular weight 
or size that is within a 0.2 basepair window; and wherein only those DNA species that fall within 
the 0.2 basepair window is subsequently subjected to nucleotide sequencing. 

As utilized herein, the term "normalized" is defined as a mixture of mRNAs (or cDNAs 
5 thereof) in which the copy number of highly abundant mRNA species is reduced relative to its 
copy number in a starting population of nucleic acids, and the copy number of a less abundant 
mRNA species has been enriched relative to the copy number of the latter mRNA in the starting 
population. 

Among the advantages provided by the present invention are that it multiple partitioning 
10 strategies function in a synergistic manner so as to ameliorate unnecessary, redundant sequencing 
of the same sequence(s), while concomitantly enhancing the sequencing of rarer sequences. 

The partition strategies disclosed herein also normalize cDNA abundance by separating 
the cDNA sequences into multiple partitions possessing minimal sequence overlap. In addition, 
the various partitioning strategies are performed so as to assure that substantially all cDNAs are 
15 sampled. An additional normalization effect may be obtained by separating the resulting DNA 
fragments based upon their overall size (i.e., size fractionation). Moreover, it is also possible to 
normalize the abundance of the cDNAs to an even greater degree by the use of one of several 
disclosed pre-characterization methods. 

All technical and scientific terms used herein have the same meanings commonly 
20 understood by one of ordinary skill in the art to which this invention belongs. Although any 

methods and materials similar or equivalent to those described herein can be used in the practice 
of the present invention, the preferred methods and materials are now described. The citation or 
identification of any reference within this application shall not be construed as an admission that 
such reference is available as prior art to the present invention. All publications mentioned 
25 herein are incorporated herein in their entirety by reference. 

Brief Description of the Drawings 

FIG. 1 is a flow diagram illustrating a method for normalizing the abundance of nucleic 
acid molecules in a population of nucleic acid molecules. 

FIG. 2 is a flow diagram illustrating a method of S'-enriched cDNA synthesis according 
30 to the invention. 
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FIG. 3A is a schematic diagram showing restriction enzyme digestion and adapter 
ligation for enrichment of 5' ends of mRNA molecules. 

FIG. 3B is a histogram showing the regions of genes covered by clones constructed using 
5' end enrichment. . 

FIG. 3C is a schematic diagram showing restriction enzyme digestion and adapter 
ligation for enrichment of mRNA molecules containing internal restriction fragments. 

FIG. 3D is a histogram showing the regions of genes covered by clones constructed 
using enrichment for internal restriction fragments. 

FIGS. 4A and 4B are schematic illustrations showing the effects of partitioning on the 
types of nucleic acids recovered in relation to the abundance of the mRNA molecules. 


Detailed Description of the Invention 

The present invention provides methods for identifying nucleic acids in a population of 
nucleic acid samples. It is based' in part on normalizing the representation of sequences that may 
be initially present in different levels iri the population of nucleic acid sequences. The 
normalization takes place by one or more methods of partitioning the nucleic acid population. 

A schematized overview of the invention is shown in FIG. 1 . At the input step 100 a 
starting population of RNA is chosen for analysis. Unless indicated otherwise, reference to a 
given RNA or population of RNAs is understood to also encompass reference to the 
corresponding cDNA or cDNAs. 

Any population of RNA molecules can be used as long as the population contains, or is 
suspected of containing, two or more distinct RNA molecules. The population can be isolated 
from a starting sample using standard methods for isolating RNA. The RNA population can be 
isolated from, e.g.. an entire organism or multiple organisms, or from a tissue or cell of an 
organisms. The RNA can also be isolated from, : e.g., cultured cells, such as eukaryotic or 
prokaryotic cells grown in vitro. If desired, the RNA can be mRNA, (e.g., polyA+ RNA), or 
stable RNAs (e.g., ribosomal RNA, transfer RNA, or small nuclear RNA). The input RNA or 
cDNA can be a subpopulation containing the 5' end of RNA molecules (1 10), a subpopulation 
having an internal regions of starting RNA molecules (112), or subpopulations containing the 3' 
end of the cDNA molecules (114). 
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The selected population or subpopulation is next subjected to a normalization analysis 
(200). The normalization analysis includes one or more partitioning steps that decrease the 
relative amount of sequences that are abundant in the starting population of nucleic acids and 
increase the relative representation of sequences that are rare in the starting population of nucleic 
5 acids. A partitioning step can take place before or after mRN A is converted to cDN A. A 
partitioning step can also take place following amplification of a cDNA. Unless stated 
otherwise, any partitioning method described herein can be used in conjunction with one or more 
additional partitioning methods. Examples of suitable partitioning steps are provided below. 

In some embodiments, cDNA molecules are subjected to digestion with restriction 
10 enzymes, after which adapter oligonucleotides are ligated to the digestion products, and the 
resulting products amplified. FIG. 1 indicates two types of digestions and adapter ligations 
which can be performed. The first, designated short chemistry (216) because it tends to result in 
shorter amplification products, uses two restriction enzymes, followed by ligation of adapter 
oligonucleotides having termini complementary to the termini of the internal digestion 
15 fragments. The second, designated long chemistry (218), similarly uses restriction digestion and 
adapter ligation but uses longer adapters, which generally result in longer amplification products. 

FIG. 1 also illustrates that the modified cDNAs can be subjected to size fractionation 
(220), which is an example of a partitioning method, and that information from the size fraction 
analysis can be used in a precharacterization analysis (222). A precharacterization can include, 
20 e.g., comparing the size of the insert to sequence databases of fragments sizes produced by the 
restriction enzyme. Amplification of short and long chemistry fragments can also be performed 
in association with partitioning steps, which are explained in detail below. 

The amplified products are next sequenced (300). Sequencing can be performed by any 
method known in the art. The compiled sequence data are then assembled (400), and the 
25 sequence generated is compared to known sequences, e.g., sequences in publicly available 
databases. 

The methods herein described are therefore useful for identifying genes, e.g., expressed 
genes in an organism of interest, e.g. , a human. The sequence information obtained is 
particularly useful for identifying genes transcribed at low levels, or generating low levels of 
30 steady state transcripts. The methods can also be used, e.g., to identify secreted proteins for 
potential therapeutic use and/or for drug targets; identify variations within the human genome, 
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such as single nucleotide polymorphisms (SNPs); identify differences between normal and 
diseased tissue; and analyze differential gene expression in different tissues and/or species. 
Partitioning prior to cDNA synthesis 

One approach to normalize levels of mRNA from a given sample, e.g. a given cell or 
tissue type, is to arbitrarily separate a starting population of RNA molecules into many smaller 
subpopulations, or collections. In general, a greater number of partitions increases the likelihood 
that a given partitions will lack a sequence or sequences that is abundant in the starting 
population of nucleic acid sequences. This method therefore allows for access to sequences that 
are expressed in very low copy number. 


Alternatively, RNA populations can be isolated from different cell types. This 
partitioning strategy is based on the premise that different tissues tend to express different 
subsets of genes. Thus. RNA sequences can be partitioned by sequencing multiple different 
cDNA libraries extracted from one or more tissues within the body. However, the partitioning 
will not typically be complete, because many genes are expressed in more than one tissue type. 
15 Synthesis and Amplification of cDNA molecules 

Typically, partitioning is performed on cDNA populations that have been modified for 
subsequent analysis. The modifications may include: (/) digesting the cDNA with at least one 
restriction endonuclease; (//) ligating an adapter oligonucleotide to one or more ends of the 
termini of the digestion products; and (Hi) amplifying the ligated products, e.g., in PCR-mediated 
20 amplification. These methods are particularly suited to cDNA molecule that have been 

constructed from the 5', internal, and 3' subpopulation of RNA molecules as described above. 
These manipulations are collectively known as SeqCat ling™ chemistry. In preferred 
embodiments, cDNA is generated from populations of RNA molecules that have been divided 
into subpopulations containing 5' ends of transcripts, populations of molecules containing 
internal regions of RNA molecules, or subpopulations containing 3' ends of RNA molecules. 


A. Construction and amplification of cDNA subpopulation enriched for the 
5 ' ends of mRNA molecules 

5'-enriched cDNA synthesis generates cDNA species that are enriched for those 
30 sequences oriented towards the S'-terminus of the cDNA, and in which a specific oligonucleotide 
sequence is ligated to the f-terminus. Approaches for generating cDNAs specifically enriched in 
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transcript 5' ends are often based on the synthesis of a homopolymeric (e.g., dG or dA) tail by the 
enzyme terminal deoxynucleotidyl transferase (TdT) subsequent to the synthesis of the first 
cDNA strand. Second strand synthesis is then primed by the use of a complementary homo- 
oligonucleotide primer sequence. See e.g., Frohman, et aL 1988. Proc. Natl. Acad. Sci. USA 85: 
5 8998-9002; Delort, et aL 1989. Nucl. Acids Res. \7: 6439-6448; Loh, et aL, 1989. Science 243: 
217-220; Belyavsky, et aL, 1989. Nucl. Acids Res. 17: 2919-2932; Ohara, et aL, 1989. Proc. 
Natl. Acad. Sci. USA 86: 5673-5677. . , 

Alternatively, amplification can exploit the S'-terminal cap structure present in eukaryotic 
mRNAs (see e.g., Furuichi & Miura, 1975. Nature 253: 374-375; Banerjee, 1980. MicrobioL 

10 Rev. 44: 175-205; Shatkin, 1985. CW/40: 223-224). However. mRNA preparations generally 
include a mixture of both capped and non-capped mRNA species. The non-capped mRNAs are 
thought to be primarily the result of degradation within the cell or during the isolation procedure. 
An alternative approach to enrich for flill-length mRNAs is to purify capped mRNA using 
affinity reagents. These reagents include naturally occurring proteins that bind the cap structure 

15 (see e.g., Edery, et aL, 1995. MoL Cell. Biol. 15: 3363-3371); anti-cap antibodies (see e.g., 
Bochnig, et aL, 1987. Eur J Biochem. 68: 460-467); and chemical modification of the cap, 
followed by selection for the modified cap structure (see e.g., Carninci, et aL, 1996. Genomics 
37: 327-336). In addition, 5'-oligo capping can also be used, in which specific oligonucleotide 
sequences are selectively added to 5'-capped mRNAs prior to first strand cDN A synthesis. 

20 Subsequent synthesis of the second strand, is primed by an oligonucleotide that is 

complementary to the modified cap sequence. See e.g., Maruyama & Sugano. 1994. Gene 138 : 
1 71-174; Suzyki, et a/./ 1997. Gene 200: 149-156: Fromont-Racine, et aL, 1993. Nucl. Acids Res. 
21: 1683-1684; U.S. Patent No. 5,597,713). 

An alternative method for isolating RNA molecules containing a capped 5' end is shown 
25 in FIG. 2. FIG. 2 depicts a flow diagram for S'-enriched cDNA synthesis using a full-length 
mRNA having a 5'-terminal cap sequence (Gppp) and a poly A+ tail. Also shown in FIG. 2 is 
truncated mRNA having a 5' terminal phosphate group. Typically, RNA preparations contain a 
mixture of full-length capped RNAs arid truncated mRNAs. The truncated RNAs can arise, e.g., 
by intracellular degradation of the RNA or by degradation of the RNA during its isolation. 

30 In the first step in FIG. 2, the free S'-terminal phosphate groups of the truncated or 

degraded mRNAs are removed by the action of a phosphatase, e.g., the bacterial alkaline 
phosphatase shown, or calf intestinal alkaline phosphatase. The phosphatase is then inactivated. 
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In the second step, the 5' cap is removed from the full-length mRNA using a pyrophosphatase, 
e.g., the tobacco acid pyrophosphatase shown in FIG. 2. The resulting product is the decapped 
full-length RNA with a free 5'-terminal phosphate group. 

In the third step in FIG. 2 , the phosphate group serves as a substrate for an RNA ligase- 
5 mediated reaction that attaches a specific DNA/RNA hybrid to the 5'-terminus of the full-length 
mRNAs. An RNA containing the ligated hybrid is used as a substrate for first and second strand 
cDNA synthesis. Preferably, a combination of oligo(dT)- and random hexamer-mediated first 
strand priming is performed in the presence of E. coli ligase to enhance overall cDNA length. 
Preferably, an RNase and thermal cycling are used to remove the RNA strand after first strand 
! 0 synthesis. The resulting single strand DNA (ssDNA) functions as a more effective reagent for 
the priming of second strand synthesis. 

Although first strand synthesis occurs for both types of mRNA species (i.e., full-length 
and truncated/degraded), only those mRNAs with the appropriate sequence ligated to the 5'- 
terminus (i.e., full-length mRNAs) contain a priming site for subsequent second "strand synthesis. 
15 Thus, RNAs derived from the full-length mRNAs are selectively amplified. 

Preferably, a thermostable enzyme for second strand synthesis in a non-thermal cycled 
temperature profile is used to ensure more stringent priming of the second strand reaction 
compared to a non-thermostable enzyme. 

A double-stranded cDNA prepared with an adapter containing an oligonucleotide 
20 sequence (nR plus "signature sequence") ligated to the 5 ? -terminus is digested with a restriction 
endonuclease as shown in FIG. 3 A. The oligonucleotide RS [SEQ ID NO:l] (or nR) is used to 
prime the PCR amplification step subsequent to the ligation of the restriction digestion products. 
The nJ/nJ PCR product is shown as lined-through to denote that it does not clone efficiently in E. 
coli. 

25 A representation of the distribution of clones derived using 5' enriched synthesis with 

respect to the region of the gene they include is shown in FIG. 3B. A reference mRNA 
containing a 5.' terminus, an ATG initiation codon, a Stop codon, and a 3' terminus is shown 
along the X-axis. Also shown is a histogram showing the number of clones (Y-axis) containing 
sequences derived from the indicated regions of the reference mRNA. The histogram reveals that 

30 the 5' enrichment method method generates distributions enriched in 5' end fragments, and has 
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increased proportions of fragments containing the start codon and the adjacent 90 bp of coding 
sequences. 

B. Construction and amplification of cDNA subpopulations enriched for the 
interior regions ends of RNA molecules 

5 To generate relatively short cDNA fragments generated from the interior regions of a 

RNA molecule, i.e., from a region not containing the 5' or 3' terminus, the following procedure 
is used. 

RNA is purified using any standard procedure (see e.g., Berger, 1987. Methods Enzymol. 
152 : 215-219) and cDNA is synthesized according to standard protocols, such as random 
10 oligomer or oligo-dT primed synthesis (see, e.g., Gubler & Hoffman, 1983, Gene 25: 263-269, 
Okayama & Berg, 1982, Mol. Cell Biol. 2: 161-170). 

The cDNA is initially digested with a pair of restriction endonucleases. Although any 
en2yme pair that generates distinct S'-terminus overhangs is acceptable, a preferred embodiment 
utilizes enzymes that possess a 4-8 basepair (bp) recognition site yielding a 0-6 bp 5 r -terminal 

15 overhang, and a more preferred embodiment utilizes enzymes that possess a 6 bp recognition 
sequence and generates a 4 bp 5' -terminus overhang. One form of manipulation for generating 
internal fragments is shown in FIG. 3C. The cDNAs are digested with two restriction 
endonucleases, yielding three types of fragments (two "homo", one "hetero" termini). Following 
digestion, specific adapters are ligated and the fragments are PCR amplified based upon the 

20 specific adapter sequence utilized. As indicated by the crossed lines, the nR~ nR and nJ-- nJ 
fragments are unstable in E. coli, and are rarely observed following cloning. 

Two suitable 24 nucleotide adapter molecules can be generated from RA24 [SEQ ID 
NO:9]; RC24 [SEQ ID NO:10]; JA24 [SEQ ID NO:l 1]; or JC24 [SEQ ID NO: 12]. The adapters 
are generated by annealing the RA24, RC24, JA24 or JC24 24-mer oligonucleotides [SEQ ID 
25 NOs:9-12, respectively] with 12-mer oligonucleotides possessing sequences that are 

complementary to the last 8 nt of the 3'-terminus of the 24-mer and the 4 bp overhang. The 
sequences of these primers and other primers described herein are provided in Table 1 . 

These 4 bp overhang sequences are chosen so as to be complementary to the overhangs 
that are generated by the restriction endonuclease digestions. In addition, the last 3-terminal 
30 nucleotide of the 24-mer adapter (i.e., A or C) is selected such that a functional restriction 

endonuclease recognition site is not re-generated when the adapter anneals to the digested cDNA. 

10 


BNSDOCID: <WO 0040757A2_L> 


WO 00/40757 PCTYUS00/00402 
Following ligation of the adapters, the restriction endonucleases are heat-inactivated, and 
the reaction mixture is PCR amplified. 

Internal fragments may alternatively be generated using a second type of adapters, which 
results in longer amplified fragments (also referred to as "Long Internal Chemistry" or "Long 
5 Chemistry"). This method is similar to short chemistry, except all adapters possess an 

additional common sequence on their S'-termini. This technique suppresses the amplification of 
small fragments while concomitantly increasing the amplification of longer fragments. The 
subsequent PCR amplification with the "X" and "J" primers results in production of both a 
hetero (/.<?., "RX--JR") adapter fragment and "homo" adapter fragments (i.e., "RX--XR" and 
10 "RJ--JR"), which are unstable in a host and are rarely observed following the cloning process. 

The effectiveness of enriching for internal fragments is shown in FIG. 3D. Several 
thousand sequences generated from internal cDNA framents and compared against a database of 
approximately 5000 known genes with annotated start and stop sites. Each sequence matching 
the database was assigned a location on the gene relative to the start (0.0) and stop (1 .0) locations 

15 relative to the location of the S'-most matching nucleotide (of the gene). The distribution from a 
standard run shows that most fragments are. located; 1 ' internal Iy v {i:e. t within.the coding region). 
Fragments covering the start codon plus an additional 90 bp (located immediately 3' of the start 
codon) are significant, because they have a high probability of containing enough sequence to 
identify secreted proteins. A small but significant fraction of the fragments covers the start 

20 codon and the additional 90 bp. 

Following digestion, adapters are ligated to these 5'-terminal overhangs. The primers are 
longer relative to primers used to generate short fragments: Two specific pairs of adapter 
molecules that can be used in long chemistry synthesis include RXC [SEQ ID NO:2]; RXA 
[SEQ ID NO:3]; RJC [SEQ ID NO:4]; or RJA [SEQ ID NO:5]. The adapters are generated by 

25 annealing RXC, RXA, RJC or RJA oligonucleotides [SEQ ID NOs:2-5, respectively] with 12- 
mer oligonucleotides possessing sequences that are complementary to the last 8 nt of the 
3 f -terminus : of the 24-mer and the 4 bp overhang. These 4 bp overhang sequences are chosen so 
as to be complementary to the overhangs that are generated by the restriction endonuclease 
digestions. In addition, the last 3 '-terminal nucleotide of the 24-mer adapter (i.e., A or C) is 

30 selected such that a functional restriction endonuclease recognition site is not re-generated when 
the adapter anneals to the digested cDNA. 
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Following the ligation of the adapters, the restriction endonucleases are heat inactivated 
and the reaction mixture is PCR amplified While the sequences of the two adapters are distinct, 
they nevertheless possess common 5' sequences that allow the formation of lariat or pan-handle 
structures that function to suppress PCR-mediated amplification of the shorter fragments. 

5 C cDNA Synthesis of molecules enriched for 3 ' ends 

3'-enriched cDNA synthesis generates cDNAs that are enriched for the sequences 
oriented towards the 3'-terminus of the cDNA. This is accomplished by synthesis of the first- 
strand using a specific oligonucleotide sequence that has been modified to contain an adapter 
sequence at its 5-terminus [SEQ ID NO: 14]. Following first-stand cDNA synthesis with the 
10 primer, standard cDNA synthesis protocols are utilized as illustrated in FIG. 2. 

The 3'-enriched eBNA is digested with one restriction endonuclease. Although any 
enzyme that generates a distinct S'-terminus overhang is acceptable, it is generally most preferred 
to utilize an enzyme that possesses a 6 bp recognition site yielding a 4 bp 5'-terminal overhang. 
Following digestion, an adapter is then ligated to these S'-terminal overhangs. These adapters are 

1 5 generated from the JA24 [SEQ ID NO: 1 1 ] or JC24. [SEQ ID NO: 1 2] 24-mer annealed with 
12 -mer oligonucleotides possessing sequences that are complementary to the last 8 nt of the 
3 f -terminus of the 24-mer and the 4 bp overhang. These 4 bp overhang sequences are chosen so 
as to be complementary to the overhangs that are generated by the restriction endonuclease 
digestions. In addition, the last 3 '-terminal nucleotide of the 24-mer adapter (i.e., A or C) is 

20 selected such that a functional restriction endonuclease recognition site is not re-generated when 
the adapter anneals to the digested cDNA. 

Following the ligation of the adapters, the restriction endonucleases are heat inactivated 
and the reaction mixture is PCR amplified. 

Longer fragments enriched for the 3 '-ends can be obtained by ligating a longer primer to 
25 cDNA molecules that have been digested with a restriction enzyme. Any enzyme that generates 
a distinct S'-terminus overhang can be used. It is generally preferred to utilize an enzyme that 
possesses a 6 bp recognition site yielding a 4 bp 5'-terminal overhang. Following digestion, an 
adapter is then ligated to the 5-terminal overhangs. Acceptable adapters are generated from the 
JA24 [SEQ ID NO: 1 1] or JG24 [SEQ ID NO: 12] 24-mer annealed with 12-mer oligonucleotides 
JO possessing sequences that are complementary to the last 8 nt of the 3'-terminus of the 24-mer and 
the 4 bp overhang. These 4 bp overhang sequences are chosen so as to be complementary to the 
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overhangs that are generated by the restriction endonuclease digestion. In addition, the last 3'- 
terminal nucleotide of the 24-mer adapter (i.e., A or C) is selected such that a functional 
restriction endonuclease recognition site is not regenerated when the adapter anneals to the • 
digested cDNA. 

While the sequences of the two adapters are distinct, they possess common 5' sequences 
that allow the formation of structures that suppress PCR-mediated amplification of the shorter 
fragments. 

Following the ligation of the adapters, the restriction endonucleases are heat inactivated 
and the reaction mixture is PCR amplified. 


The cDNA fragments prepared as above can be size-fractionated, e.g., electrophoretic 
fractionation on agarose or polyacrylamide gels, or other types of gels comprised of a similar 
material. The cDNA fragments may then be physically excised in defined size ranges (i.e., as 
identified by size makers) and recovered from the excised gel fragments. Additionally, if the 
quantities of isolated cDNA fragments are. low, they can be amplified, e.g., by PCR amplification 
For example, ifthe.cDNA fragments are generated by Long Internal SeqCalling™ Chemistry 
protocol, they are amplified with J23 [SEQ ID NO:6] and X22 [SEQ ID NO: 15] primers (either 
before or after fractionation) prior to cloning, as these cDNAs cannot be efficiently cloned into E. 
coli. Similarly, if the cDNA fragments are generated by Long 5' SeqCalling™ Chemistry 
protocol, they can be amplified by J23 [SEQ ID NO:6] and RS [SEQ ID NO: 1] oligonucleotides 
20 (either before or after fractionation) prior to cloning, as these products cannot be efficiently 
cloned into E. coli. 

When PCR amplification is used to amplify fragments, conditions are preferentially 
chosen to minimize non-productive hybridization events. It has been observed that DNA re- 
hybridization during the PCR amplification process (designated the "Cot effect"; see e.g., 

25 Mathieu-Daude, et al, 1996. Nucl. Acids Res. 24: 2080-2084) can inhibit amplification. This 
effect is particularly evident during later PCR amplification cycles, when a substantial 
concentration of the amplified product has accumulated and the primer concentration has been 
depleted. As a result, amplification in the later PCR cycles typically follow non-linear dynamics. 
By manipulating PCR amplification reaction conditions, it is possible to markedly 

30 enhance the "Cot effect", by the insertion of a slow-annealing step in between the denaturation 
and re-naturation steps in each PCR amplification cycle. The slow-annealing temperature is 
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chosen so as to be above that of the primer-template melting temperature (T m ), but at or above 
that of the template-template T m , thus favoring template-template annealing over template-primer 
annealing. For example, a 85-75°C decrease in temperature at a 10°C/minute gradient can be 
utilized 

5 

Partitioning methods 

One or more of the following techniques, or combinations these techniques, can be used 
to normalize the abundance of RNA (or their cDNA counterpart) species within a given cell or 
tissue sample. 

10 (i) Partitioning by restriction endonuclease digestion 

A cDNA library can be partitioned into many different sets of fragments by digestion 
with different restriction enzyme pairs. Fragmentation of the same cDNA library with different 
sets of restriction enzymes, in different reaction vessels, results in segregated multiple partitions, 
i.e., each specific fragment will occur in only one partition. The digested fragments can be 
15 analyzed further, e.g., by direct sequencing, cloning of the digested fragments or sequencing, or 
one or more of these techniques. 

If desired, the cDNA is digested into fragments of a length that is convenient for 
sequencing. Preferably, multiple different partitions, e.g., 10-100, 20-750, or 50-250 partitions 
are obtained. 

20 

(ii) Partitioning by fragment size or other physical property 

Partitioning can also be performed using other separation methods that separate DNA 
molecules according to their physical characteristics. The methods can include, e.g. , separation 
based on physical and/or biochemical properties {i.e., molecular weight/size, terminal nucleotide 
25 sequences, exact migratory pattern, and the like). Separation methods can include, e.g., gel 
electrophoresis, including agarose or polyacrylamide gel electrophoresis, high pressure liquid 
chromatography (HPLC), preparative-scale capillary electrophoresis, and similar methodologies. 

In one embodiment, unique cDNAs that represent unique (i.e., not previously sequenced) 
fragments are selected based on their presence in a characteristic restriction enzyme fragment. In 
30 this process, a cDNA population is digested with restriction endonucleases, fractionated, and 
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fragments in a desired size range are recovered. The recovered fragments are then ligated to a 
vector and transformed into an appropriate host, e.g., E. coli. Rather that being directly 
sequenced following the selection process, the DNA fragments are isolated and separated, e.g., 
sized using one or more sizing matrixes that separate the molecules as a function of their physical 
5 or biochemical properties. The embodiment is thus referrred to as "clone sizing". Those 

recombinant clones that have an insert with characteristics not present in a reference database are 
determined to contain a unique DNA fragment. Preferably, only unique fragments are 
subsequently sequenced. 

For example, a DNA fragment that is sized in this way possesses two pieces of 
10 information that serve as a unique identifier: (/) the identity of the restriction endonuclease used 
to generate the fragment, and (//) the size of the fragment. With these two pieces of information, 
fragments are picked for subsequent nucleotide sequencing by searching for a specific fragment 
within a 0.2 basepair window. If a fragment is present in .the window, the E. coli clone 
containing the fragment is re-arrayed on a liquid handling robot such as a Tecan Genesis or 
1 5 Packard Multiprobedevice, and sequenced. cWhen multiple fragments are present within^the 0.2 
bp window,, only one is selected to be sequenced; Thus; by use of this sizing filter, sequencing of 
identical fragments is significantly lowered. 

By sizing individual fragments and comparing the observed size to previously determined 
sequences, i.e., using a "sizing filter", only fragments of unique lengths need to be sequenced. 

20 To pre-size large numbers of fragments, the fragments can be initially pooled as a 

function of their expected size, so as to ensure the any fragment occurs in a minimum of at least 
three individual pools. 

Size fractionation may be accomplished in a number of ways. One commonly utilized 
method is electrophoretic fractionation on agarose or polyacrylamide gels, or other types of gels 

25 comprised of a similar material. The cDNA fragments may then be physically excised in defined 
size ranges (i.e., as identified by size makers) and recovered from the excised gel fragments. 
Additionally, if the quantities of isolated cDNA fragments are low, they can be PCR amplified at 
this stage. For example, if the cDNA fragments are generated by Long Internal SeqCalling™ 
Chemistry protocol, described above, they must be amplified with J23 and X22 primers (either 

30 before or after fractionation) prior to cloning, as these cDNAs cannot be efficiently cloned into E. 
coli. Similarly, if the cDNA fragments are generated by Long 5' SeqCalling™ Chemistry 
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protocol, described above, they must be amplified by J23 and RS oligonucleotides (either before 
or after fractionation) prior to cloning, as these products cannot be efficiently cloned into E. coli. 


(Hi) Partitioning based on hybridization 

5 Screening can be performed using a variety of methods that rely on hybridization 

between a probe sequence or sequences and a cDNA library. Members of the library containing 
a homologous sequence are then removed from the library. For example, a cDNA library can be 
brought into contact with a prepared library of known sequence in such a way that any sequence 
contained within the substrate library that is complimentary to any element of the subtraction 
10 library is removed or suppressed. This method obviates re-characterizing, e.g., re-sequencing, 
already characterized members of the cDNA population. 

(iv) Amplification-associated partitioning 

Partitioning can also be performed in association with amplification. In particular, 
15 partitioning can be carried out during PCR amplification of adapter-li gated cDNA fragments 
described above. During PCR-mediated amplification of mixtures of cDNA fragments, short 
fragments tend to be preferentially amplified relative to large fragments. PCR conditions can be 
adjusted to favor the formation of larger fragments within the PCR reaction to allow efficient 
preferential amplification of longer fragments. 

20 Normally, two different primers are used in PCR amplification to prime the enzymatic 

activity of the polymerase at each terminus of the target sequence. Conversely, if primers with 
identical 5' sequences are used, there is a tendency for the fragments to form lariat or pan-handle 
structures, due to intra-strand hybridization, which interferes with the amplification process. 
Because the probability of the two ends of a polymer (i.e., cDNA fragment) finding one another 

25 is inversely proportional to a fractional power of the polymer length, short fragments tend to 
form these lariat structures more readily than do longer ones. Accordingly, this effect is 
exploited in the amplification of long cDNA fragments. See U.S. Patent No. 5,565,340, whose 
disclosure is incorporated herein by reference, in its entirety. 

Long fragment amplification can be enhanced using DNA fragments to which have been 
30 ligated long adapter sequences as described above. Amplification is dependent upon a number of 
factors that can alter the ratio of a linear adapter structure, which is permissive for amplification, 
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and a lariat-loop structure, which suppresses amplifications. The equilibrium constant associated 
with the formation of the suppressive and the permissive structures, and. therefore, the efficiency 
of suppression of particular DNA fragments during PCR, is primarily a function of the following 
factors: (/) differences in melting temperature of suppressive and permissive structures; (//) 
position of the primer sequence within the adapter; (/'//) the length of the target DNA fragments; 
(i'v) PGR primer concentration; and (v) primary structure. 

Analysis of partitioned cDNA molecules 

Partitioned cDNA molecules are next analyzed by comparing the sequences to a reference 
nucleic acid or nucleic acids. To facilitate analysis of partitioned cDNA molecules, they can, if 
not subcloned previously, be ligated into an appropriate vector and transformed into cells by any 
applicable method. 

The reference nucleic acid or nucleic acids can be any fragment for which sufficient 
information is available to unambiguously identify the partitioned cDNA molecule. The 
reference nucleic acid or nucleic acids can therefore be part of, e.g., sequence databases, or 
databases of other characteristics that unambiguously identify a nucleic acid. Examples of such 
characteristics include e.g., a compilation of fragment sizes associated with specific restriction 
enzymes for a particular gene. In some embodiments, partitioned nucleic acids will be 
sequenced. The partitioned sequences can be sequenced by any method known to the art and the 
resulting sequence data is analyzed by computer-based systems. 

Suitable databases include publicly available databases that comprehensively record all 
observed DNA sequences. Such databases include, e.g., GenBank from the National Center for 
Biotechnology Information (Bethesda, Md.), the EMBL Data Library at the European 
Bioinformatics Institute (Hinxton Hall, UK) and databases from the National Center for Genome 
Research (Santa Fe, N.Mex.). However, any database containing entries for the sequences likely 
to be present in such a sample to be analyzed is usable in the further steps of the computer 
methods. Methods of searching databases are described in detail in e.g., U.S. Patent No. 
5,871,697, whose disclosure is incorporated herein by reference, in its entirety. 
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Table 1 below summarizes the various primers and adapters disclosed herein. 


Table 1 


SEQ 
ID NO: 

Name 

Sequence (from 5' to 3') 

l 

r> c 

Kb 

CTCTCCGATG 

CAGGTGGC 




2 


AGCACACTCC 

AGCCTCTCTC 

CGAGCACATG 

CGACACTGAG 

ryn "ft /""inn 7V. 

TACTAC 


KaA 

AGCACACTCC 

AGCCTCTCTC 

CGAGCACATG 

CGACACTGAG 

m "ft /""irr» TV 7V 

TACTAA 

4 

KJC 

AvjL-Av^AL. ILL 

7a r* t* r* t* r* t 
AoLL ILiLlL 


1 CLjAAI ai ll 


5 

KJA 

AGCACACTCC 

AGCCTCTCTC 

CGAACCGACG 

TCGAATATCC 

ATGCAGA 

o 


ACCGACGTCG 

AATATCCATG 

CAG 



7 

R23 

AGCACACTCC 

AGCCTCTCTC 

CGA 



8 

NR17 

AGCACACTCC 

AGCCTCT 




9 

RA24 

AGCACACTCC 

AGCCTCTCTC 

CGAA 



10 

RC24 

AGCACACTCC 

AGCCTCTCTC 

CGAC 



11 

JA24 

ACCGACGTCG 

AATATCCATG 

CAGA 



12 

JC24 

ACCGACGTCG 

AATATCCATG 

CAGC 



13 

Dt-R 

AGCACACTCC 

AGCCTCTCTC 

CGA 



14 


AGCACACTCC 

AGCCTCTCTC 

CGATTTTTTT 


TTT 


5 EXAMPLES 

The invention will be further described in the following examples, which do not limit the 
scope of the invention described in the claims. Examples 1 -6 collectively describe the synthesis 
and amplification of cDNA subfractions enriched for the 5* terminal sequences of mRNA 
molecules. Example 7 describes clone sizing. 

to Example 1. 5 f cDNA Synthesis — phosphatase/pyrophosphate digestion 

For each reaction, 2.5 jag mRNA (do not exceed 3 |ug total) is added to H 2 0 so as to 
provide a total volume of 73.5 \x\. This mixture is then heated to 65°C for 10 minutes, and 
quick-cooled on ice. The CIAP Cocktail (see below) is made as follows: 

CIAP Cocktail : 

1 5 For each reaction: 1 0 |il 1 Ox CIAP buffer 1 1 0 p.1 

2.5 jal RNasin (Promega) x 1 1 27.5 \xl 

10|il 0.1MDTT 110 >il 

4 |il 0.01 U/nl CIAP* 35 nl 

20 1) 26.5 (al of the above enzyme mixture is added to each 3 |il mRNA to give 
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a total volume of 30.5 73.5 p.1 of the RNA mix is then added to give a 

final volume of 100 jlxI. 



2) 
3) 

Incubate at 37°C for 40 minutes. 

Add 100 ul TE buffer (10 raM Tris pH 8.0; 0.1 mM EDTA). 

5 

.4) 

Add 200 nl Acid-Phenol. 


...5) 

Mix vigorously. 


6) 

Add 200 ul Chloroform-Isoamyl Alcohol (24: 1 v/v). 


7) 

Mix vigorously. 


8) 

Centrifuge in a microfuge at maximum speed for 10 minutes. 

10 

9) 

Remove supernatant and transfer to new tube. Discard bottom layer. 


10) 

Repeat steps 4-9 (only for CIAP treatment, not in later steps). 


ID 

Add 2 ul ssDNA carrier and 20 ul 3 M Sodium Acetate to each tube. 


12) 

Vortex 10 seconds and add 440 ul of absolute ethanol. 


13) 

Vortex 10 seconds and incubate at least 30 minutes at -80°C. 

15 

14) 

Centrifuge samples at 13,200 x g for 15 minutes. 


15) 

Wash nucleic acid pellets with 70% ethanol and air-drv Dellet 


16) 

Dissolve nucleic acid pellet in 70 ^tl water and cool on ice. 


17) 

Centrifuge for 10-15 seconds at maximum speed. 


18) 

Transfer contents of tubes to 8-strip tubes. 

20 

19) 

Add 30 ul TAP cocktail (see below). 


TAP Cocktail : 


For each reaction: 1 0 ul 1 Ox TAP buffer 1 1 0 ul 

2.5 ul RNasin x 11 27.5 ul 

15.5 ulH 2 0 170.5 ul 

2.0 ul 10 U/ul TAP (Epicenter) 22 ul 


20) Add 30 ul of above mixture to each 70ul CIAP-treated sample for 
' a total volume of 100 ul. 
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21 ) Incubate at 37°C for 45 minutes. 

22) Repeat Phenol/Chloroform extraction and precipitation as above in 
steps 6-9 and then 11-15 (do not resuspend pellet). 

Example 2. 5' cDNA Synthesis: DNA-RNA Hybrid Primer Ligation 

1) Transfer samples from Example 1 to 8-strip tubes. 

2) Resuspend pellet in Ligation Cocktail (see below). 


Ligation Cocktail : 

For each reaction: 3 |il 10 mM ATP 33 jal 

10 1 \i\ RNasinx 11 11 ^1 

4.5 nl H.O 49.5 |al 

2 p.1 R-BAP-TAP DNA/RNA hybrid oligomer 22 fil 


3) Add 10.5 |j.l of above mixture to each pelleL dissolve pellet completely at 
1 5 room temperature by (preferably) tapping the tube or vortexing if needed. 

4) Make an enzyme mix as follows: 

Enzyme Mixture : 

For each reaction: 30 \i\ H 2 0 330 jj.1 

12 nl 5x DNA Ligase Buffer (Life Tech) x 1 1 132 ^1 

20 1.5filRNasin 16.5^1 

6 ^tl T/RNA Ligase (Life Tech.) 66 pil 


Total reaction volume 60 |al 
5) Incubate overnight at 20°C. 
25 6) Repeat Phenol/Chloroform and precipitation as above in CIP/TAP Cocktail 

protocol steps 6-9 and 11-15 (do not resuspend pellet). 

Example 3. 5 ? cDNA Synthesis: cDNA First-Strand Synthesis 

1) Resuspend cDNA pellet in Random Hexamer Cocktail (see below). 
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Random Hexamer Cocktail : 

For each reaction: 1 0 jil H 2 0 x 1 1 1 10 jal 

0.5 jal random hexamer (dN 6 -5'-Phosphate, 100 fiM) 5.5 |nl 
5 jil Oligo-(dT) (dT 30 VN-5Thosphate, 100 jiM) 55 jal 


2) Add 15.5 \i\ of above mixture to each tube and resuspend pellet. 

3) Heat at 70°C for 10 minutes and quick-cool on ice. 

4) Make First-Strand Synthesis Cocktail as follows (see below). 

First-Strand Synthesis Cocktail : 

10 For each reaction: 6 jai 5x First-Strand Buffer 66 |a.l 

3 lOmMdNTPs 33 fil 

3 yxl 100 mMDTTx 11 33^1 

1 |il RNase Inhibitor 1 1 jliI 


15 5) Add 13 \xl of the above mixture to each 15.5 |il sample to give a total volume 

of 28.5 ill 

6) Incubate at 37°C for 2 minutes. 

7) Add 1.5 fil Superscript II RT to each reaction for a total volume of 30 jil. 

8) Incubate at 37°C for 10 minutes. 
20 9) Incubate at 42°C for 1 hour. 

10) Incubate at 16°C. 

11) Add 40 \xl of the following DNA Ligase Mixture (see below) to each reaction 
tube for a total volume of 70 \iL ■ 

E. coli DNA Ligase Mixture : 

25 For each reaction: 4 jj.1 1 Ox E. coli Ligase Buffer x 1 1 44 \xl 

33 |il H 2 0 ' 330 ^1 

3 \il E. coli DNA Ligase (10 U/|il) 33 jil 


12) Continue incubation at 16°C for 2 hours. 
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Example 4. 5 f cDNA Synthesis: removal of non-ligated Primers 

While the above 2 hour incubation described in Example 3 is progressing, prepare 
one Boehringer-Mannheim Quick-Spin G-50 columns per reaction as follows: 

1 ) Mix the resin bed well by inverting the columns repeatedly. 

2) Remove the top cap first, and then the bottom cap. This avoids bubble formation 
and resultant poor performance of the spin-column. 

3) Stand column vertically and allow to drain completely. 

4) Add 0.75 ml of 10 mM Tris (pH 7.5) to the top of the bed without disturbing. 
If the bed becomes disturbed, pipette the solution up and down slowly to mix 
the bed uniformly and allow the bed to re-settle so as to form a uniform surface. 

5) Stand column vertically and allow to drain completely. 

6) Place the columns into a 15 ml conical centrifuge tube with the vendor's 
associated collector tube beneath the spin-column to collect the sample. 

7) Centrifuge spin-column at 1000-1200 x g for 2 minutes. 

8) Remove spin-column with a forceps and remove the tube with flow through 
and discard. 

9) Carefully load the sample to the top center of the spin-column. 

1 0) Wash the sample tube with 20 \xl H 2 0 and load on the same column. 

1 1) Place a new collection tube beneath each spin-column and centrifuge at 
1 000- 1 200 xg for 4 minutes. 

1 2) Remove spin-columns and collect the flow-through into new, labeled tubes. 

13) Total sample volume will be approximately 105 \xL 
Example 5. 5' cDNA Synthesis: RNase (H, A, and T,) Treatment 

1 ) To each reaction described in Example 4 add Second-Strand Reaction Buffer (see 

below). 
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Second-Strand Reaction Buffer : 

For each reaction: 3 p.1 100 mM DTT 33 \xl 

6 |al First-Strand Buffer . 33 jal 

30 jil Second-Strand Buffer x 1 1 330 p.1 

6nlH 2 0 66^1 


2) Add 45 of the above mixture to each 105 jil sample to give a total volume of 
150 jaL 

3) Add 2 |al of RNase H to each sample. 

4) Incubate at 37°C for 30 minutes to nick the RNA in RNA/DNA hybrids. 

5) Make an RNase Mixture comprising: 22jal RNase H, 44 jal RNase Cocktail 
(Ambion; available as an RNase A and RNase T, mixture). 

6) Heat samples to 95°C for 2 minutes. 

7) Slow cool down to 37°C and continue incubation. 

8) Add 3 fil RNase Mixture to each of the cDNAs, mix by pipetting up and down. 

9) Continue incubation at 37°C for an additional 10 minutes. 

1 0) Heat samples to 95°C for 2 minutes. 

1 1 ) Slow cool down to 37°C and continue incubation. 

12) Add an additional 3 p.1 of RNase Mixture to each of the cDNAs, mix by 
pipetting up and down. 

13) Continue incubation at 37°C for an additional 15 minutes. 

14) Repeat Phenol/Chloroform extraction and precipitation as above in 
steps 6-9 and then 11-15. 

1 5) Dissolve pellet in 20 |il H 2 0. 

16) Remove a 5 jal aliquot for Second-Strand (see below) synthesis for producing 
5'-cDNA for SeqCalling™ Chemistry Protocol. 
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Example 6. Second-Strand Synthesis for Producing S'-cDNA for SeqCalling™ Chemistry 

1 ) Generate PCR Mixture (see below) as follows: 

PCR Mixture: 

For each reaction: 5 p.1 1 Ox PCR Buffer x 1 1 55 nl 

lullOmMdNTPs 5.5 jil 

1 nl 10 fiM R17 Primer 5.5 fil 

37.5^ilH 2 0 412.5 fil 

0.5 ^1 Advantage Polymerase . 5,5. nl 


10 2) Add 45 (il of the above mixture to each 5 p.1 sample, for a total volume 50 \iL 

3) Heat samples as per protocol below, making sure that the sample tubes are placed 
in the thermocycler only after it has reached >80°C. 

94°C for 2 minutes | 

55°C for 2 minutes | x 1 Cycle ONLY 

15 : 72°C for 60 minutes | ' (Cycle designated KM-AD-2N) 

4°C for long-term storage 

4) Warm reaction tubes to 37°C. 

5) Make SAP Cocktail (see below) as follows 

SAP Cocktail : 

20 For each reaction: 1 2 \xl lOx SAP Buffer x 1 1 1 32 jal 

5 |il H 2 0 55 nl 

3 jil Shrimp Alkaline Phosphatase (SAP; 1 \J/\il) 33 |il 


6) Add 20 (^1 of SAP Cocktail to each reaction. 

25 7) Heat to 37°C for 30 minutes. 

8) Purify samples by Qiagen 96-well plate as manufacture's protocol. 

9) Elute cDNAs in 100 jal lOmM Tris-HCl buffer and proceed with fluorometry. 
Example 7. Clone Sizing 

SeqCalling™ Chemistry products generated in any of Examples 1-6 are diluted and re- 
30 amplified. Fractionation is then performed by electrophoresising the re-amplified sample on an 
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agarose gel using MetaPhor agarose (FMC). After the electrophoresis, the gel is physically cut 
into a total of 48 fractions. 24 of the fractions are derived from a 4% MetaPhor gel, and 
correspond to the lower molecular weight fractions; whereas the other 24 fractions derived from 
the 3% MetaPhor gel, correspond to the upper molecular weight fractions. 

5 Following the elution of the DNA from the gel fractions, the DNA fragments are ligated 

into a vector with the TOPO-TA cloning vector (Invitrogen). These plasmids are then 
transformed into E. coli. The transformed bacterial cells are plated onto petri dishes and grown 
to a size that allows automated colony picking. A suitable number of colonies/fraction are 
selected so as to ensure a statistically accurate representation of the DNA fragments contained 

10 within the fraction (/.<?., suitable numbers of picked colonies/fraction are 48 or 96). Following 
the incubation of the selected clones, the fragment contained within each individual clone are 
sized using the proprietary MegaBACE system, or an equivalent. Sizing is performed with 
multiple clones/lane. This multiplexing allows sizing to be performed in a cost and time efficient 
manner. The multiplexing is performed with a liquid handling robot {e.g., Matrix PlateMate). 

15 After running the multiplexed fragments on MegaBACE, a^nd correlating the size of the fragment 
with the E. coli clone containing the insert, the fragments are analyzed to determine suitability 
for sequencing. 


Example 8. Comparison of clone complexity with and without use of a sizing step 

20 The effect of using a clone sizing step on the complexity, i.e., the representation of rarely 

transcripts, of the resulting clones, is shown in FIGS. 4A and 4B. In FIG. 4A, no sizing step was 
used, while clone sizing was used in the identification of the clones shown in FIG. 4B. Shown in 
the figures is a comparison of the frequencies (expressed in percentage) of clones derived from 
transcripts present at varying levels. The outer numbers represent the prevalence of a particular 

25 clone sequenced, and the inner numbers represents the percentages of the total number of clones 
sequenced that fall into this abundance class. As illustrated in FIG. 4A, the sequencing results 
that were obtained without the use of the sizing filter demonstrated that only a small percentage 
of the total number of fragments that were sequenced were included low copy number fragments 
{i.e., singletons, duplicates, and triplicates). Specifically, singletons were found to comprise only 

30 2% of the total number of fragments sequenced, while fragments that were present at greater than 
51 copies comprised 38% of the total fragments sequenced. In contrast, as illustrated in FIG. 4B, 
the sequencing results that were obtained with the use of the sizing filter were enriched for clones 

25 
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from low abundance transcripts (i.e., singletons, duplicates, and triplicates). These clones 
constituted approximately 33% of the total fragments sequenced. In contrast, without the use of 
this sizing filter, these fragments were found to only comprised a total of 8% of the sequencing 
results. 


Equivalents 

Although particular embodiments have been disclosed herein in detail, this has been done 
by way of example for purposes of illustration only, and is not intended to be limiting with 
respect to the scope of the appended claims that follow. In particular, it is contemplated by the 
10 inventor that various substitutions, alterations, and modifications may be made to the invention 
without departing from the spirit and scope of the invention as defined by the claims. For 
example, the selection of the specific tissue(s) or cell line(s) that is to be utilized in the practice 
of the present invention is believed to be a matter of routine for a person of ordinary skill in the 
art with knowledge of the embodiments described herein. 

15 


BNSDOCID: <WO 0040757 A2J_> 


26 


. WO 00/40757 PCT/US00/00402 
WHAT IS CLAIMED IS: 

1 . A method of screening a population of nucleic acids for a novel sequence, the method 
comprising: 

providing a population of nucleic acid sequences; 

partitioning said population into one or more subpopulations of nucleic acids; 
identifying a first nucleic acid sequence in the subpopulation of nucleic acid sequences; 

and 

comparing the first nucleic acid sequence to a reference nucleic acid sequence or 
sequences, wherein the absence of the first nucleic acid sequence in the reference nucleic acid o 
nucleic acid sequences indicates the first nucleic acid is a novel nucleic acid sequence. 

2. The method of claim 1 , wherein said DNA population is a cDNA population derived 
from a population of RNA molecules. 


3. The method of claim 2, further comprising partitioning the RNA molecules. 

4. The method of claim 2, wherein said cDNA population is derived from the 5* ends of the 
RNA molecules. - 

5. The method of claim 2, wherein said cDNA population is derived from the interior 
regions of the RNA molecules. 

6. The method of claim 2, wherein said cDNA population is derived from the 3 ' ends of the 
DNA molecules. 


7. The method of claim 2, wherein said partitioning step comprises hybridization of a probe 
nucleic acid sequence to the population of nucleic acids. 

8. The method of claim 2, wherein said partitioning step comprises digesting the cDNA 
molecules with one or more restriction enzymes. 
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9. The method of claim 8. further comprising ligating adapter oligonucleotides to the 
termini of the digested cDNA molecules. 

1 0. The method of claim 9, further comprising amplifying the ligation products. 

1 1 . The method of claim 8, further comprising separating the amplified products. 

12. The method of claim 1 1 ? whprein said separating is by gel electrophoresis. 

13. The method of claim 1 L wherein the first nucleic acid sequence is identified by 
comparing the size of one or more digestion products produced by a member of the 
subpopulation of nucleic acids to the sizes of fragments generated by the same restriction enzyme 
or enzymes in said reference nucleic acid or nucleic acids. 

14. The method of claim 1 1, further comprising 

. recovering one or more size-separated digestion products; 
reamplifying the recovered products; and 
separating the reamplified products. 

15. The method of claim 14. wherein said separating is by gel electrophoresis. 

16. The method of claim 1 5, wherein the first nucleic acid sequence is identified by 
comparing the size of one or more digestion products produced by a member of the 
subpopulation of nucleic acids to the sizes of fragments generated by the same restriction enzyme 
or enzymes in said reference nucleic acid or nucleic acids. 

17. The method of claim 9, further comprising: 

inserting the ligated adapter oligonucleotide into a cloning vector to form a vector-insert; 
transforming the vector-insert into a suitable host; 

culturing transformed host under conditions allowing for replication of the vector-insert; 
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recovering the vector-insert from said host; and 

digesting the vector-insert with one or more restriction enzymes, thereby releasing said 
insert; and 

comparing the size of the insert to sizes of fragments generated by the same restriction 
enzyme or enzymes in said reference nucleic acid or nucleic acids. 

18. The method of claim 1 , wherein comparing is by determining at least a portion of the 
nucleotide sequence of the first nucleic acid sequence and comparing the nucleotide sequence to 
the nucleotide sequence of one or more reference nucleic acids. 

19. The method of claim 1. wherein comparing is by hybridizing the first nucleic acid 
sequence to one or more of the reference nucleic acid sequences. 

20. A method for equalizing the representation of nucleic acids in a population of nucleic 
acids, the method comprising: 

providing a population of nucleic acid sequences, wherein said population comprises a 
first nucleic acid and a second nucleic acid having a nucleic acid sequence distinct from the first 
nucleic acid, and wherein said first nucleic acid is present at a higher level in said population 
than said second population; 

partitioning said population into one or more subpopulations of nucleic acids; and 

comparing the levels of said first nucleic acid sequence to the levels of said second 
nucleic acid sequence in the subpopulation of nucleic acid sequences, wherein a lower level of 
the first nucleic acid sequence relative to the second nucleic acid sequence indicates the 
representation of said first and second nucleic acid sequences are normalized. 

21. A method for producing a population of nucleic acid molecules enriched for 5' regions of 
mRNA molecules, the method comprising: 

providing a population of RNA molecules, said population including RNA molecules 
having a 5' terminal Gppp cap structure and a 5' terminal phosphate group; 
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contacting said population of RNA molecules with a phosphatase under conditions that 
result in removal of the 5' terminal phosphate group while leaving the 5' terminal Gppp cap 
structure intact; 

inactivating said phosphatase; 

contacting the population of RN A molecules with a pyrophosphatase under conditions 
that result in the removal of the 5* terminal Gppp and the formation of a 5 ? phosphate group; 

annealing an oligonucleotide in the presence of an RNA ligase to form a hybrid molecule; and 

forming a cDNA from said oligonucleotide. 

22. A method of identifying an RNA sequence in a sample comprising a plurality of RNA 
sequences, the method comprising: 

synthesizing cDNA copies of a plurality of RNA species to form a cDNA sample; 

determining the size of one or more of said cDNA molecules in said cDNA sample; 

comparing the size of said sample with the size of a reference nucleic acid: and 

thereby identifying the cDNA sequence. 

23. The method of claim 22. wherein said cDNA molecules are digested with one or more 
restriction enzymes prior to the determining step. 

24. The method of claim 23, further comprising ligating adapter oligonucleotides to the 
termini of the digested cDNA molecules prior to the determining step. 

25. The method of claim 22, wherein said identifying step comprises comparing the size of 
one or more digestion products produced by one or more said cDNA molecules to a reference 
nucleic acid or nucleic acids. 

26. A method of identifying an RNA sequence in a population of RNA sequences, the 
method comprising: 
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(a) removing 5' terminal pppG from RNAs in said population to form a population of 
RNAs having terminal 5' phosphate groups; 

(b) ligating a linker oligonucleotide to the terminal 5' phosphate groups of RNA 
molecules in said population of RNAs; 

.(c). synthesizing complementary cDNA molecules from said population of RNA 
molecules to form a cDNA sample; 

(d) digesting said complementary cDNA molecules with at least one restriction enzyme; 

(e) ligating an adapter molecule to the digested cDNA molecules; 

(f) amplifying the molecules produced in step (e); 

(g) identifying the amplified molecules of step (f); and 

(h) comparing the amplified molecules to one or more reference nucleic acids. 
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METHOD OF IDENTIFYING NUCLEIC ACIDS 


Disclosed are methods for identifying nucleic acids in a sample of nucleic acids in which 
nucleic acids are initially present in unequal amounts. The methods include partitioning the 
starting population of nucleic acids to form one or more subpopulations, and then identifying 
nucleic acids that are present in different amounts in the partitioned nucleic acid sample as 
compared to the starting population. 
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SEQUENCE LISTING 

<110> Curagen Corporation 
Rothberg et al . 

<120> METHOD OF IDENTIFYING NUCLEIC -ACIDS 

<130> 15966-539-061 

<140> Not Yet Assigned 
<141> 2000-01-07 

<150> 60/115,109 
<151> 1999-01-08 

<150> 09/417,386 
<151> 1999-10-13 

<160> 14 

<170> Patentln Ver. 2.0 

<210> 1 

<211> 18 

<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 

<400> 1 

ctctccgatg caggtggc 18 

<210> 2 
<211> 46 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 
<400> 2 

agcacactcc agcctctctc cgagcacatg cgacactgag tactac 46 

<210> 3 

<211> 46 

<212> DNA 

<213> Artificial Sequence 
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<220> 

<223> Description of Artificial Sequence: PCR primer 
<400> 3 

agcacactcc agcctctctc cgagcacatg cgacactgag tactaa 46 

<210> 4 
<211> 47 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 
<400> 4 

agcacactcc agcctctctc cgaaccgacg tcgaatatcc atgcagc 47 

<210> 5 
<211> 47 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 
<400> 5 

agcacactcc agcctctctc cgaaccgacg tcgaatatcc atgcaga 47 

<210> 6 

<211> 23 

<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 

<400> 6 

accgacgtcg aatatccatg cag 23 

<210> 7 

<211> 23 

<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 
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<400> 7 


agcacactcc agcctctctc cga 


23 


<210> 8 
<211> 17 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 
<400> 8 

agcacactcc agcctct 17 

<210> 9 
<211> 24 
<212> DNA 

<213> Artificial Sequence 


<210> 10 
<211> 24 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 
<400> 10 

agcacactcc agcctctctc cgac 24 

<210> 11 
<211> 24 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 


<220> 

<223> Description of Artificial Sequence: PCR primer 


<400> 9 


agcacactcc agcctctctc cgaa 


24 


<400> 11 


accgacgtcg aatatccatg caga 


24 


<210> 12 
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<211> 24 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 
<400> 12 

accgacgtcg aatatccatg cage 24 

<210> 13 
<211> 23 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 
<400> 13 

agcacactcc agcctctctc cga 23 

<210> 14 
<211> 43 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 
<400> 14 

agcacactcc agcctctctc cgattttttt tttttttttt ttt 43 
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Method of Identifying Nucleic Acids 


Related Applications 

This application claims priority to USSN 60/1 15,1 09, filed January 8, 1999, which is 
incorporated herein in its entirety. , 

5 Field of the Invention 

The present invention relates to nucleic acids and more particularly to methods of 
equalizing the representation of nucleic acids in a population of nucleic acid molecules. 

Background of the Invention 

Approximately 10,000-20,000 genes are thought to be expressed within living cells, 
depending upon the specific cell type. RNAs corresponding to different genes can be present in 
different levels in cells. For example, transcripts from as few as 10-15 genes may represent 10- 
1 5% of cellular mRNA by mass. In addition to these highly abundant transcripts, another 1000- 
2000 genes encode moderately abundant transcripts, which can account for up to 50% of cellular 
mRNA mass. Transcripts from the remaining genes fall into the low abundance class. 

Because many genes are identified by isolating complementary DNA (cDNA) 
corresponding to an RNA sequence, a significant problem can arise because of differences in the 
levels at which specific RNAs are present in cell types. The most abundant sequences can be 
repeatedly sampled, while the lowest abundance class may be rarely, if ever, sampled. 

Several normalization and subtractive hybridization protocols have been developed to 
help overcome this problem. These techniques can be technically difficult to perform, and they 
can fail to detect cDNAs corresponding to rare transcripts. 

Summary of the Invention 

The invention is based in part on the discovery of novel procedures for equalizing, or 
normalizing, the representation of nucleic acids in a sample of nucleic acids in which different 
25 nucleic acids are initially present in the sample in unequal amounts. 

Accordingly, in one aspect the invention provides a method of screening a population of 
nucleic acid sequences. The method includes providing a population of nucleic acid sequences, 
partitioning the population into one or more subpopulations of nucleic acids, and identifying a 
first nucleic acid sequence having an increased level in the subpopulation relative to its level in 
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the starting population of nucleic acids. The first nucleic acid is then compared to a reference 
nucleic acid sequence or sequences. The absence of the first nucleic acid sequence in the 
reference nucleic acid or nucleic acid sequences indicates the first nucleic acid is a novel nucleic 
acid sequence. 1 
5 The RNA can be derived from a plant, a single-celled animal, a multi-cellular animal, a 

bacterium, a virus, a fungus, or a yeast. If desired, the RNA can also be partitioned prior to 
synthesizing cDNA. 

Among the advantages of the methods are that they eliminate, or minimize, redundant 
identification and characterization of identical nucleic acid sequences in a population of nucleic 
10 acids.. 

In some embodiments, the cDNA is synthesized to selectively generate cDNA species 
that are enriched for those sequences oriented towards the 5*-terminus of the cDNA. In other 
embodiments, the cDNA is synthesized to enrich for those sequences oriented towards the 
3 T -terminus of the cDNA. 

15 In some embodiments, the population is normalized by digesting the cDNAs with one or 

more restriction endonucleases, in different reaction vessels, so as to generate segregated 
multiple partitions. Preferably, each specific digested cDNA-fragment will occur in only one 
partition. 

In some embodiments, the cDNAs are partitioned by physical methods, which may 
20 optionally follow the restriction endonuclease digestion. The physical methods separate the 

cDNAs a function of their terminal nucleotide sequences, overall length and migratory pattern on 
a sizing matrix that possesses the ability to separate molecules as a function of their physical 
and/or biochemical properties. 

In other embodiments, the cDNAs are partitioned during subsequent PCR-based 
25 amplification of adapter-ligated cDNA fragments that have been digested with one or more 
restriction endonucleases. 

In other embodiments, the cDNAs are partitioned by screening the original mixture of 
cDNAs so as to remove those sequences that have already been characterized. Screening occurs 
using partitioned subtraction, whereby the original cDNAs are brought into contact with a 
30 prepared, subtraction library of known sequence in such a way that any sequence contained 
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within the original library that is complimentary to any element of the subtraction library is 
removed or suppressed. 

cDNA sequences may also be partitioned by determining the size of each cDNA fragment 
prior to sequencing; biasing for formation of larger fragment PCR products by lariat formation. 
5 In this method, a bias for the larger fragment within the PCR reaction is introduced to allow 

efficient preferential amplification of longer fragments. Alternatively, partitioning may occur by 
preferentially amplifying 5* terminal or 3' terminal sequences of mRNA molecules. 

If desired, the amplified cDNAs may fractioned by separating the amplified cDNAs on a 
sizing matrix that separates molecules as a function of their physical and/or biochemical 
10 properties and excising individual cDNA fragments from said sizing matrix. The excised cDNA 
fragments are then inserted into a recombinant vector, or further amplified. 

In some embodiments, the restriction endonuclease is a restriction endonuclease that 
possesses a recognition sequence 4 to 8 basepairs in length and produces either a 5 - or 3-terminal 
overhang 0 to 6 basepairs in length. 

15 In some embodiments, the. identified sequence is subjected to computational analysis. 

The computational analysis can include querying, or searching, a nucleotide sequence database to 
identify sequences that match, or the absence of any sequences that match. The database 
includes a plurality of known nucleotide sequences of nucleic acids that may be present in the 
sample. 

20 Preferably, the nucleic acid database comprises substantially all the known, expressed 

nucleic acid sequences derived from a group comprising a plant, a single-celled animal, a multi- 
cellular animal, a bacterium, a virus, a fungus, or a yeast; 

In some embodiments, sizing includes diluting and re-amplification of the cDNAs, 
fractionating the re-amplified cDNAs by use of one or more sizing matrixes that separate the 

25 molecules as a function of their physical and/or biochemical characteristics, physically dividing 
or cutting the sizing matrixes into a plurality of sections, wherein each section is comprised of 
one or more cDNAs of similar molecular weight or size. The cDNAs are eluted from each of the 
sizing matrix section, ligated into a cloning vector and transformed into a host, e.g., a bacterial 
host. A plurality of the transformed host colonies are selected so as to ensure a statistically- 

30 accurate representation of the cDNAs originally contained within the sizing matrix sections. The 
inserts from this plurality of colonies are recovered and their molecular weight or size of are 
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determined. A plurality of insert DNAs, wherein each successive insert has a molecular weight 
or size that is within a 0.2 basepair window; and wherein only those DN A species that fall within 
the 0.2 basepair window is subsequently subjected to nucleotide sequencing. 

As utilized herein, the term fci normalized M is defined as a mixture of mRNAs (or cDNAs 
5 thereof) in which the copy number of highly abundant mRNA species is reduced relative to its 
copy number in a starting population of nucleic acids, and the copy number of a less abundant 
mRNA species has been enriched relative to the copy number of the latter mRNA in the starting 
population. 

Among the advantages provided by the present invention are that it multiple partitioning 
10 strategies function in a synergistic manner so as to ameliorate unnecessary, redundant sequencing 
of the same sequence(s), while concomitantly enhancing the sequencing of rarer sequences. 

The partition strategies disclosed herein also normalize cDNA abundance by separating 
the cDNA sequences into multiple partitions possessing minimal sequence overlap. In addition, 
the various partitioning strategies are performed so as tp assure that substantially all cDNAs are 
15 sampled. An additional normalization effect may be obtained by separating the resulting DNA 
fragments based upon their overall size (/.<?., size fractionation). Moreover, it is also possible to 
normalize the abundance of the cDNAs to an even greater degree by the use of one of several 
disclosed pre-characterization methods. 

All technical and scientific terms used herein have the same meanings commonly 
20 understood by one of ordinary skill in the art to which this invention belongs. Although any 

methods and materials similar or equivalent to those described herein can be used in the practice 
of the present invention, the preferred methods and materials are now described. The citation or 
identification of any reference within this application shall not be construed as an admission that 
such reference is available as prior art to the present invention. All publications mentioned 
25 herein are incorporated herein in their entirety by reference. 

Brief Description of the Drawings 

FIG. 1 is a flow diagram illustrating a method for normalizing the abundance of nucleic 
acid molecules in a population of nucleic acid molecules. 

FIG. 2 is a flow diagram illustrating a method of 5 -enriched cDNA synthesis according 
30 to the invention. 
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FIG. 3 A is a schematic diagram showing restriction enzyme digestion and adapter 
ligation for enrichment of 5' ends of mRNA molecules. 

FIG. 3B is a histogram showing the regions of genes covered by clones constructed using 
5' end enrichment. 

FIG. 3C is a schematic diagram showing restriction enzyme digestion and adapter 
ligation for enrichment of mRNA molecules containing internal restriction fragments. 

FIG. 3D is a histogram showing the regions of genes covered by clones constructed 
using enrichment for internal restriction fragments. 

FIGS. 4 A and 4B are schematic illustrations showing the effects of partitioning on the 
types of nucleic acids recovered in relation to the abundance of the mRNA molecules. 

Detailed Description of the Invention 

The present invention "provides methods for identifying nucleic acids in a population of 
nucleic acid samples. It is based in part on riormaliiitlg-the representation of sequences that may 
be initially present in different levels in the population of nucleic acid sequences. The 
normalization takes place by one or more methods of partitioning the nucleic acid population. 

A schematized overview of the invention is shown in FIG. 1 . At the input step 100 a 
starting population of RNA is chosen for analysis. Unless indicated otherwise, reference to a 
given RNA or population of RNAs is understood to also encompass reference to the 
corresponding cDNA or cDNAs. 

Any population of RNA molecules can be used as long as the population contains, or is 
suspected of containing, two or more distinct RNA molecules. The population can be isolated 
from a starting sample using standard methods for isolating RNA. The RNA population can be 
isolated from, e.g., an entire organism or multiple organisms, or from a tissue or cell of an 
organisms. The RNA can also be isolated from, e.g., cultured cells, such as eukaryotic or 
prokaryotic cells grown in vitro. If desired, the RNA can be mRNA, (e.g., polyA+ RNA), or 
stable RNAs (e.g., ribosomal RNA, transfer RNA, or small nuclear RNA). The input RNA or 
cDNA can be a subpopulation containing the 5' end of RNA molecules (1 10), a subpopulation 
having an internal regions of starting RNA molecules (II 2), or subpopulations containing the 3 ? 
end of the cDNA molecules (114). 
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The selected population or subpopulation is next subjected to a normalization analysis 
(200). The normalization analysis includes one or more partitioning steps that decrease the 
relative amount of sequences that are abundant in the starting population of nucleic acids and 
increase the relative representation of sequences that are rare in the starting population of nucleic 
5 acids. A partitioning step can take place before or after mRNA is converted to cDNA. A 
partitioning step can also take place following amplification of a cDNA. Unless stated 
otherwise- any partitioning method described herein can be used in conjunction with one or more 
additional partitioning methods. Examples of suitable partitioning steps are provided below. 

In some embodiments, cDNA molecules are subjected to digestion with restriction 
10 enzymes, after which adapter oligonucleotides are ligated to the digestion products, and the 
resulting products amplified. FIG. 1 indicates two types of digestions and adapter ligations 
which can be performed. The first, designated short chemistry (216) because it tends to result in 
shorter amplification products, uses two restriction enzymes, followed by ligation of adapter 
oligonucleotides having termini complementary to the termini of the internal digestion 
1 5 fragments. The second, designated long chemistry (2 1 8), similarly uses restriction digestion and 
adapter ligation but uses longer adapters, which generally result in longer amplification products. 

FIG. 1 also illustrates that the modified cDNAs can be subjected to size fractionation 
(220), which is an example of a partitioning method, and that information from the size fraction 
analysis can be used in a precharacterization analysis (222). A precharacterization can include, 
20 e.g., comparing the size of the insert to sequence databases of fragments sizes produced by the 
restriction enzyme. Amplification of short and long chemistry fragments can also be performed 
in association with partitioning steps, which are explained in detail below. 

The amplified products are next sequenced (300). Sequencing can be performed by any 
method known in the art. The compiled sequence data are then assembled (400), and the 
25 sequence generated is compared to known sequences, e.g., sequences in publicly available 
databases. 

The methods herein described are therefore useful for identifying genes, e.g., expressed 
genes in an organism of interest, e.g., a human. The sequence information obtained is 
particularly useful for identifying genes transcribed at low levels, or generating low levels of 
30 steady state transcripts. The methods can also be used, e.g., to identify secreted proteins for 
potential therapeutic use and/or for drug targets; identify variations within the human genome, 


BNSDOCID: <WO 0040757A3_IA> 


WO 00/40757 PCT/US00/00402 
such as single nucleotide polymorphisms (SNPs); identify differences between normal and 
diseased tissue; and analyze differential gene expression in different tissues and/or species. 

Partitioning prior to cDNA synthesis 

One approach to normalize levels of mRNA from a given sample, e.g. a given cell or 
5 tissue type, is to arbitrarily separate a starting population of RNA molecules into many smaller 
subpopulations, or collections. In general, a greater number of partitions increases the likelihood 
that a given partitions will lack a sequence or sequences that is abundant in the starting 
population of nucleic acid sequences. This method therefore allows for access to sequences that 
are expressed in very low copy number. 

10 Alternatively, RNA populations can be isolated from different cell types. This 

partitioning strategy is based on the premise that different tissues tend to express different 
subsets of genes. Thus, RNA sequences can be partitioned by sequencing multiple different 
cDNA libraries extracted from one or more tissues within the body. However, the partitioning 
will not typically be complete, because many genes are expressed in more than one tissue type. 

15 Synthesis and Amplification of cDNA molecules 

Typically, partitioning is performed on cDNA populations that have been modified for 
subsequent analysis. The modifications may include: (/) digesting the cDNA with at least one 
restriction endonuclease; (//) Iigating an adapter oligonucleotide to one or more ends of the 
termini of the digestion products; and (Hi) amplifying the ligated products, e.g., in PCR-mediated 

20 amplification. These methods are particularly suited to cDNA molecule that have been 

constructed from the 5 \ internal, and 3' subpopulation of RNA molecules as described above. 
These manipulations are collectively known as SeqCalling™ chemistry. In preferred 
embodiments, cDNA is generated from populations of RNA molecules that have been divided 
into subpopulations containing 5' ends of transcripts, populations of molecules containing 

25 internal regions of RNA molecules, or subpopulations containing 3' ends of RNA molecules. 

A. . Construction and amplification of cDNA subpopulation enriched for the 
5 ' ends of mRNA molecules 

S'-enriched cDNA synthesis generates cDNA species that are enriched for those 
30 sequences oriented towards the 5'-terminus of the cDNA, and in which a specific oligonucleotide 
sequence is ligated to the 5'-terminus. Approaches for generating cDNAs specifically enriched in 
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transcript 5' ends are often based on the synthesis of a homopolymeric {e.g., dG or dA) tail by the 
enzyme terminal deoxynucleotidyl transferase (TdT) subsequent to the synthesis of the first 
cDNA strand. Second strand synthesis is then primed by the use of a complementary homo- 
oligonucleotide primer sequence. See e.g., Frohman,.e/ al. 1988. Proc. Natl Acad. Sci. USA 85: 
8998-9002; Delort, et al, 1989. Nucl. Acids Res. 17: 6439-6448; Loh. et al, 1989. Science 243: 
217-220; Belyavsky, et aL 1989. Nucl Acids Res. 17: 2919-2932; Ohara. et al, 1989. Proc. 
Natl. Acad. ScL USA 86^5673-5^77. 

Alternatively, amplification can exploit the 5'-terminal cap structure present in eukaryotic 
mRNAs (see e.g., Furuichi & Miura, 1975. Nature 253: 374-375; Banerjee, 1980. Microbiol. 
Rev. 44: 175-205; Shatkin, 1985. Cell 40: 223-224). However, mRNA preparations generally 
include a mixture of both capped and non-capped mRNA species. The non-capped mRNAs are 
thought to be primarily the result of degradation within the cell or during the isolation procedure. 
An alternative approach to enrich for full-length mRNAs is to purify capped mRNA using 
affinity reagents. These reagents include naturally occurring proteins that bind the cap structure 
(see e.g., Edery, et a/., ,1995. MoL. Cell BioL 15 3363-3371): anti-cap antibodies (see e.g., 
Bochnig, et al, 1987. Eur J Biochem. 68: 460-467); and chemical modification of the cap, 
followed by selection for the modified cap structure (see e.g. CarnincL et aL, 1996. Genomics 
37: 327-336). In addition, 5'-oligo capping can also be used, in which specific oligonucleotide 
sequences are selectively added to 5'-capped mRNAs prior to first strand cDNA synthesis. 
Subsequent synthesis of the second strand, is primed by an oligonucleotide that is 
complementary to the modified cap sequence. See e.g., Maruyama & Sugano. 1994. Gene 138: 
171.174; Suzyki, et al, 1997. Gene 200: 149-156; Fromont-Racine, et al, 1993. Nucl Acids Res. 
21: 1683-1684; U:S. Patent No. 5,597,713). 

An alternative method for isolating RNA molecules containing a capped 5' end is shown 
in FIG. 2. FIG. 2 depicts a flow diagram for 5'-enriched cDNA synthesis using a full-length 
mRNA having a 5'-terminal cap sequence (Gppp) and a poly A+ tail. Also shown in FIG. 2 is 
truncated mRNA having a 5' terminal phosphate group. Typically, RNA preparations contain a 
mixture of full-length capped RNAs and truncated mRNAs. The truncated RNAs can arise, e.g., 
by intracellular degradation of the RNA or by degradation of the RNA during its isolation. 

In the first step in FIG. 2, the free S'-terminal phosphate groups of the truncated or 
degraded mRNAs are removed by the action of a phosphatase, e.g., the bacterial alkaline 
phosphatase shown, or calf intestinal alkaline phosphatase. The phosphatase is then inactivated. 
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In the second step, the 5' cap is removed from the full-length mRNA using a pyrophosphatase, 
e.g., the tobacco acid pyrophosphatase shown in FIG. 2. The resulting product is the decapped 
full-length RNA with a free S'-terminal phosphate group. 

In the third step in FIG. 2 , the phosphate group serves as a substrate for an RNA ligase- 
5 mediated reaction that attaches a specific DNA/RNA hybrid to the 5'-terminus of the full-length 
mRN As. An RNA containing the ligated hybrid is used as a substrate for first and second strand 
cDNA synthesis. Preferably, a combination of oligo(dT)- and random hexamer-mediated first 
strand priming is performed in the presence of E. coli ligase to enhance overall cDNA length. 
Preferably, an RNase and thermal cycling are used to remove the RNA strand after first strand 
10 synthesis. The resulting single strand DNA (ssDNA) functions as a more effective reagent for 
the priming of second strand synthesis. 

Although first strand synthesis occurs for both types of mRNA species {i.e.. full-length 
and truncated/degraded), only those mRNAs with the appropriate sequence ligated to the 5 T - 
terminus (i.e., full-length mRNAs) contain a priming site for subsequent second strand synthesis. 
15 Thus, RNAs derived from the full-length mRNAs are selectively amplified. 

Preferably, a thermostable enzyme for second strand synthesis in a non-thermal cycled 
temperature profile is used to ensure more stringent priming of the second strand reaction 
compared to a non-thermostable enzyme. 

A double-stranded cDNA prepared with an adapter containing an oligonucleotide 
20 sequence (nR plus "signature sequence") ligated to the 5 '-terminus is digested with a restriction 
endonuclease as shown in FIG. 3A. The oligonucleotide RS [SEQ ID NO:l] (or nR) is used to 
prime the PCR amplification step subsequent to the ligation of the restriction digestion products. 
The nJ/nJ PCR product is shown as lined-through to denote that it does not clone efficiently in E. 
coli. 

25 A representation of the distribution of clones derived using 5' enriched synthesis with 

respect to the region of the gene they include is shown in FIG. 3B. A reference mRNA 
containing a 5/terminus. an ATG initiation codon, a Stop codon, and a 3\terminus is shown 
along the X-axis. Also shown is a histogram showing the number of clones (Y-axis) containing 
sequences derived from the indicated regions of the reference mRNA. The histogram reveals that 

30 the 5' enrichment method method generates distributions enriched in 5' end fragments, and has 
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increased proportions of fragments containing the start codon and the adjacent 90 bp of coding 
sequences. 

B. Construction and amplification of cDNA subpopulations enriched for the 
interior regions ends of RNA molecules 

5 To generate relatively short cDNA fragments generated from the interior regions of a 

RNA molecule, i.e.. from a region not containing the 5' or 3' terminus, the following procedure 
is used. 

RNA is purified using any-standard procedure (sec e.g., Berger, 1987. Methods Enzymol. 
152 : 215-219) and cDNA is synthesized according to standard protocols, such as random 
10 oligomer or oligo-dT primed synthesis (see. e.g., Gubler & Hoffman, 1983, Gene 25: 263-269, 
Okayama & Berg, 1982, Mol. Cell Biol. 2: 161-1,70). 

The cDNA is initially digested with a pair of restriction endonucleases. Although any 
enzyme pair that generates distinct 5 f -terminus overhangs is acceptable, a preferred embodiment 
utilizes enzymes that possess a 4-8 basepair (bp) recognition site yielding a 0-6 bp 5'-terminal 

1 5 overhang, and a more preferred lembodiment utilizes enzymes that possess a 6;bp recognition 
sequence and generates a 4 bp 5'-terminus overhang. One form of manipulation for generating 
internal fragments is shown in FIG. 3C. The cDNAs are digested with two restriction 
endonucleases, yielding three types of fragments (two "homo", one "hetero" termini). Following 
digestion, specific adapters are ligated and the fragments are PCR amplified based upon the 

20 specific adapter sequence utilized. As indicated by the crossed lines, the nR— nR and nJ~nJ 
fragments are unstable in E. coli. and are rarely observed following cloning. 

Two suitable 24 nucleotide adapter molecules can be generated from RA24 [SEQ ID 
NO:9]; RC24 [SEQ ID NO:10J; JA24 [SEQ ID NO:l 1]; or JC24 [SEQ ID NO:12]. The adapters 
are generated by annealing the RA24, RC24, JA24 or JC24 24-mer oligonucleotides [SEQ ID 
25 NOs:9-12, respectively] with 12-mer oligonucleotides possessing sequences that are 

complementary to the last 8 nt of the 3 -terminus of the 24-mer and the 4 bp overhang. The 
sequences of these primers and other primers described herein are provided in Table 1 . 

These 4 bp overhang sequences are chosen so as to be complementary to the overhangs 
that are generated by the restriction endonuclease digestions. In addition, the last 3 '-terminal 
30 nucleotide of the 24-mer adapter {i.e., A or C) is selected such that a functional restriction 

endonuclease recognition site is not re-generated when the adapter anneals to the digested cDNA. 

10 
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Following ligation of the adapters, the restriction endonucleases are heat-inactivated, and 
the reaction mixture is PCR amplified. 

Internal fragments may alternatively be generated using a second type of adapters, which 
results in longer amplified fragments (also referred to as 6k Long Internal Chemistry'' or "Long 
5 Chemistry"). This method is similar to short chemistry, except all adapters possess an 

additional common sequence on their 5'-termini. This technique suppresses the amplification of 
small fragments while concomitantly increasing the amplification of longer fragments. The 
subsequent PCR amplification with the "X 5 - and "J" primers results in production of both a 
hetero (i.e., "RX--JR") adapter fragment and "homo" adapter fragments (i.e., "RX--XR" and 
10 "RJ-- JR"), which are unstable in a host and are rarely observed following the cloning process. 

The effectiveness of enriching for internal fragments is shown in FIG. 3D. Several 
thousand sequences generated from internal cDNA framents and compared against a database of 
approximately 5000 known genes with annotated start and stop sites.. Each sequence matching 
the database was assigned a location on the gene relative to the start (0.0) and stop (1.0) locations 

15 relative to the location of jhe**5Vmost matching nucleotide (of the gene). The distribution from a 
standard run shows that most fragments are located "internally" (i.e., within the coding region). 
Fragments covering the start codon plus an additional 90 bp (located immediately 3* of the start 
codon) are significant, because they have a high probability of containing enough sequence to 
identify secreted proteins. A small but significant fraction of the fragments covers the start 

20 codon and the additional 90 bp. 

Following digestion, adapters are ligated to these S'-terminal overhangs. The primers are 
longer relative to primers used to generate short fragments. Two specific pairs of adapter 
molecules that can be used in long chemistry synthesis include RXC [SEQ ID NO:2]; RXA 
[SEQ ID NO:3]; RJC [SEQ ID NO:4]; or RJA [SEQ ID NO:5]. The adapters are generated by 

25 annealing RXC, RXA, RJC or RJA oligonucleotides [SEQ ID NOs:2-5, respectively] with 12- 
mer oligonucleotides possessing sequences that are complementary to the last 8 nt of the 
3-terminus of the 24-mer and the 4 bp overhang. These 4 bp overhang sequences are chosen so 
as to be complementary to the overhangs that are generated by the restriction endonuclease 
digestions. In addition, the last 3'-terminal nucleotide of the 24-mer adapter (i.e., A or C) is 

30 selected such that a functional restriction endonuclease recognition site is not re-generated when 
the adapter anneals to the digested cDNA.. 


11 
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Following the ligation of the adapters, the restriction endonucleases are heat inactivated 
and the reaction mixture is PCR amplified. While the sequences of the two adapters are distinct, 
they nevertheless possess common 5 r sequences that allow the formation of lariat or pan-handle 
structures that function to suppress PCR-mediated amplification of the shorter fragments. 

5 C. cDNA Synthesis of molecules enriched for 3 ' ends 

3'-enriched cDNA synthesis generates cDNAs that are enriched for the sequences 
oriented towards the S'-terminus of the cDNA. This is accomplished by synthesis of the first- 
strand using a specific oligonucleotide sequence that has been modified to contain an adapter 
sequence at its 5-terminus [SEQ ID NO: 14]. Following first-stand cDN A synthesis with the 
10 primer, standard cDNA synthesis protocols are utilized as illustrated in FIG. 2. 

The 3'-enriched cDNA is digested with one restriction endonuclease. Although any 
enzyme that generates a distinct 5'-terminus overhang is acceptable, it is generally most preferred 
to utilize an enzyme that possesses a 6 bp recognition site yielding a 4 bp S'-terminal overhang. 
Following digestion, an adapter is then ligated to these ^'-terminal overhangs. These adapters are 

1 5 generated from the JA24 [SEQ ID NO: 1 1 ] or JC24 [SEQ ID NO: 1 2] 24-mer annealed with 
12-mer oligonucleotides possessing sequences that are complementary to the last 8 nt of the 
S'-terminus of the 24-mer and the 4 bp overhang. These 4 bp overhang sequences are chosen so 
as to be complementary to the overhangs that are generated by the restriction endonuclease 
digestions. In addition, the last 3'-terminal nucleotide of the 24-mer adapter (i.e., A or C) is 

20 selected such that a functional restriction endonuclease recognition site is not re-generated when 
the adapter anneals to the digested cDN A. 

. Following the ligation of the adapters, the restriction endonucleases are heat inactivated 
and the reaction mixture is PCR amplified. 

Longer fragments enriched for the 3'-ends can be obtained by ligating a longer primer to 
25 cDNA molecules that have been digested with a restriction enzyme. Any enzyme that generates 
a distinct 5'-terminus overhang can be used. It is generally preferred to utilize an enzyme that 
possesses a 6 bp recognition site yielding a 4 bp 5-terminal overhang. Following digestion, an 
adapter is then ligated to the 5-terminal overhangs. Acceptable adapters are generated from the 
JA24 [SEQ ID NO:l 1] or JC24 [SEQ ID NO: 12] 24-mer annealed with 12-mer oligonucleotides 
30 possessing sequences that are complementary to the last 8 nt of the 3 r -terminus of the 24-mer and 
the 4 bp overhang. These 4 bp overhang sequences are chosen so as to be complementary to the 

12 


BNSDOCID: <WO 0O4O757A3_IA> 


WO 00/40757 PCT/US00/00402 
overhangs that are generated by the restriction endonuclease digestion. In addition, the last 3'- 
terminal nucleotide of the 24-mer adapter (Le. t A or C) is selected such that a functional 
restriction endonuclease recognition site is not regenerated when the adapter anneals to the 
digested cDNA. 

While the sequences of the two adapters are distinct, they possess common 5' sequences 
that allow the formation of structures that suppress PCR-mediated amplification of the shorter 
fragments. • 

Following the ligation of the adapters,' the restriction endonucleases are heat inactivated 
and the reaction mixture is PCR amplified. 

The cDNA fragments prepared as above can be size-fractionated, e.g., electrophoretic 
fractionation on agarose or polyacrylamide gels, or other types of gels comprised of a similar 
material. The cDNA fragments may then be physically excised in defined size ranges (i.e., as 
identified by size makers) and recovered from the excised gel fragments. Additionally, if the 
quantities of isolated cDNA fragments are low, they cari be amplified, e.g., by PCR amplification 
For example, if the cDNA fragments are generated by Long Internal SeqCalling™ Chemistry 
protocol, they are amplified with J23 [SEQ ID NO:6] and X22 [SEQ ID NO: 15] primers (either 
before or after fractionation) prior to cloning, as these cDNAs cannot be efficiently cloned into E. 
coll. Similarly, if the cDNA fragments are generated by Long 5' SeqCalling™ Chemistry 
protocol, they can be amplified by J23 [SEQ ID NO:6] and RS [SEQ ID NO: 1] oligonucleotides 
(either before or after fractionation) prior to cloning, as these products cannot be efficiently 
cloned into E. coli. 

When PCR amplification is used to amplify fragments, conditions are preferentially 
chosen to minimize non-productive hybridization events. It has' been observed that DNA re- 
hybridization during the PCR amplification process (designated the "Cot effect"; see e.g., 
Mathieu-Daude, et al. y 1996. Nucl. Acids Res. 24: 2080-2084) can inhibit amplification. This 
effect is particularly evident during later PGR amplification cycles, when a substantial 
concentration of the amplified product has accumulated and the primer concentration has been 
depleted. As a result, amplification in the later PCR cycles typically follow non-linear dynamics. 

By manipulating PCR amplification reaction conditions, it is possible to markedly 
enhance the "Cot effect", by the insertion of a slow-annealing step in between the denaturation 
and re-naturation steps in each PCR amplification cycle. The slow-annealing temperature is 
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chosen so as to be above that of the primer-template melting temperature (T m ), but at or above 
that of the template-template T m , thus favoring template-template annealing over template-primer 
annealing. For example, a 85-75°C decrease in temperature at a 10 Q C/minute gradient can be 
utilized 

5 

Partitioning methods 

One or more of the following techniques, or combinations these techniques, can be used 
to normalize the abundance of RNA (or their cDNA counterpart) species within a given cell or 
tissue sample. 

10 (i) Partitioning by restriction endonuclease digestion 

A cDNA library can be partitioned into many different sets of fragments by digestion 
with different restriction enzyme pairs. Fragmentation of the same cDNA library with different 
sets of restriction enzymes, in different reaction vessels, results in segregated multiple partitions, 
i.e., each specific fragment will occur. in only one partition. The digested fragments can be 
15 analyzed further, e.g., by direct sequencing, cloning of the digested fragments or sequencing, or 
one or more of these techniques. 

If desired, the cDNA is digested into fragments of a length that is convenient for 
sequencing. Preferably, multiple different partitions, e.g., 10-100, 20-750. or 50-250 partitions 
are obtained. 

20 

(ii) Partitioning by fragment size or other physical property 

Partitioning can also be performed using other separation methods that separate DNA 
molecules according to their physical characteristics. The methods can include, e.g., separation 
based on physical and/or biochemical properties (i.e., molecular weight/size, terminal nucleotide 
25 sequences, exact migratory pattern, and the like). Separation methods can include, e.g., gel 
electrophoresis, including agarose or polyacrylamide gel electrophoresis, high pressure liquid 
chromatography (HPLC), preparative-scale capillary electrophoresis, and similar methodologies. 

In one embodiment, unique cDNAs that represent unique (i.e., not previously sequenced) 
fragments are selected based on their presence in a characteristic restriction enzyme fragment. In 
30 this process, a cDNA population is digested with restriction endonucleases, fractionated, and 
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fragments in a desired size range are recovered. The recovered fragments are then ligated to a 
vector and transformed into an appropriate host, e.g., E. coli. Rather that being directly 
sequenced following the selection process, the DNA fragments are isolated and separated, e.g., 
sized using one or more sizing matrixes that separate the molecules as a function of their physical 
5 or biochemical properties. The embodiment is thus referrred to as "clone sizing". Those 

recombinant clones that have an insert with characteristics not present in a reference database are 
determined to contain a unique DNA fragment. Preferably, only unique fragments are 
subsequently sequenced. 

For example, a DNA fragment that is sized in this way possesses two pieces of 
10 information that serve as a unique identifier: (/) the identity of the restriction endonuclease used 
to generate the fragment, and (//) the size of the fragment. With these two pieces of information, 
fragments are picked for subsequent nucleotide sequencing by searching for a specific fragment 
within a 0.2 basepair window. If a fragment is present in the window, the E. coli clone 
containing the fragment is re-arrayed on a liquid handling robot such as a Tecan Genesis or 
1 5 Packard Multiprobe device, and sequenced; When multiple fragments are present within the 0.2 
bp window, only one is selected to be sequenced. Thus, by use of this sizing filter, sequencing of 
identical fragments is significantly lowered. 

By sizing individual fragments and comparing the observed size to previously determined 
sequences, i.e., using a "sizing filter", only fragments of unique lengths need to be sequenced. 

20 To pre-size large numbers of fragments, the fragments can be initially pooled as a 

function of their expected size, so as to ensure the any fragment occurs in a minimum of at least 
three individual pools. 

Size fractionation may be accomplished in a number of ways. One commonly utilized 
method is electrophoretic fractionation on agarose or polyacrylamide gels, or other types of gels 

25 comprised of a similar material. The cDNA fragments may then be physically excised in defined 
size ranges (i.e., as identified by size makers) and recovered from the excised. gel fragments. 
Additionally, if the quantities of isolated cDNA fragments are low, they can be PCR amplified at 
this stage. For example, if the cDNA fragments are generated by Long Internal SeqCalling™ 
Chemistry protocol, described above, they must be amplified with J23 and X22 primers (either 

30 before or after fractionation) prior to cloning, as these cDNAs cannot be efficiently cloned into E. 
coli. Similarly, if the cDNA fragments are generated by Long 5* SeqCalling™ Chemistry 
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protocol, described above, they must be amplified by J23 and RS oligonucleotides (either before 
or after fractionation) prior to cloning, as these products cannot be efficiently cloned into £. coli. 


(Hi) Partitioning based on hybridization 

Screening can be performed using a variety of methods that rely on hybridization 
between a probe sequence or sequences and a cDNA library. Members of the library containing 
a homologous sequence are then removed from the library. For example, a cDNA library can be 
brought into contact with a prepared library of known sequence in such a way that any sequence 
contained within the substrate library that is complimentary to any element of the subtraction 
library is removed or suppressed. This method obviates re-characterizing, e.g., re-sequencing, 
already characterized members of the cDNA population. 


(iv) Amplification-associated partitioning 

Partitioning can also be performed in association with amplification. In particular, 
15 partitioning can be carried out during PCR amplification of adapter-ligated cDNA fragments 
described above. During PCR-mediated amplification of mixtures of cDNA fragments, short 
fragments tend to be preferentially amplified relative to large fragments. PCR conditions can be 
adjusted to favor the formation of larger fragments within the PCR reaction to allow efficient 
preferential amplification of longer fragments. 
20 Normally, two different primers are used in PCR amplification to prime the enzymatic 

activity of the polymerase at each terminus of the target sequence. Conversely, if primers with 
identical 5' sequences are used, there is a tendency for the fragments to form lariat or pan-handle 
structures, due to intra-strand hybridization, which interferes with the amplification process. 
Because the probability of the two ends of a polymer (i.e., cDNA fragment) finding one another 
25 is inversely proportional to a fractional power of the polymer length, short fragments tend to 
form these lariat structures more readily than do longer ones. Accordingly, this effect is 
exploited in the amplification of long cDNA fragments. See U.S. Patent No. 5,565,340, whose 
disclosure is incorporated herein by reference, in its entirety. 

Long fragment amplification can be enhanced using DNA fragments to which have been 
30 ligated long adapter sequences as described above. Amplification is dependent upon a number of 
factors that can alter the ratio of a linear adapter structure, which is permissive for amplification, 
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and a lariat-loop structure, which suppresses amplifications. The equilibrium constant associated 
with the formation of the suppressive and the permissive structures, and, therefore, the efficiency 
of suppression of particular DNA fragments during PCR, is primarily a function of the following 
factors: (/) differences in melting temperature of suppressive and permissive structures; (//) 
5 position of the primer sequence within the adapter; (///) the length of the target DNA fragments; 
(/v) PCR primer concentration; and (v) primary structure. 

Analysis of partitioned cDNA molecules 

Partitioned cDNA molecules are next analyzed by comparing the sequences to a reference 
nucleic acid or nucleic acids. To facilitate analysis of partitioned cDNA molecules, they can, if 
10 not subcloned previously, be ligated into an appropriate vector and transformed into cells by any 
applicable method. 

The reference nucleic acid or nucleic acids can be any fragment for which sufficient 
information is available to unambiguously identify the partitioned cDNA molecule. The 
reference nucleic acid or nucleic acids can therefore be part of, e.g., sequence databases, or 
15 databases of other characteristics that unambiguously identify a nucleic acid. Examples of such 
characteristics include e.g., a compilation of fragment sizes associated with specific restriction 
enzymes for a particular gene. In some embodiments, partitioned nucleic acids will be 
sequenced. The partitioned sequences can be sequenced by any method known to the art and the 
resulting sequence data is analyzed by computer-based systems. 

20 Suitable databases include publicly available databases that comprehensively record all 

observed DNA sequences. Such databases include, e.g.. GeriBank from the National Center for 
Biotechnology Information (Bethesda, Md.), the EMBL Data Library at the European 
Bioinformatics Institute (Hinxton Hall, UK) and databases from the National Center for Genome 
Research (Santa Fe, N.Mex.). However, any database containing entries for the sequences likely 

25 to be present in such a sample to be analyzed is usable in the further steps of the computer 
methods. Methods of searching databases are described in detail in e.g., U.S. Patent No. 
5,871 ,697, whose disclosure is incorporated herein by reference, in its entirety. 
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Table 1 below summarizes the various primers and adapters disclosed herein. 


Table 1 


SEQ 
ID NO: 

Name 

Sequence (from 5' to 3') 

1 

RS 

CTCTCCGATG CAGGTGGC 

2 

RXC 

AGCACACTCC AGCCTCTCTC CGAGCACATG CGACACTGAG TACTAC 


RXA 

AGCACACTCC AGCCTCTCTC CGAGCACATG CGACACTGAG TACTAA 

4 

RJC 

AGCACACTCC AGCCTCTCTC CGAACCGACG TCGAATATCC ATGCAGC 

5 

RJA 

AGCACACTCC AGCCTCTCTC CGAACCGACG TCGAATATCC ATGCAGA 

6 

J23 

ACCGACGTCG AATATCCATG CAG 

7 

R23 

AGCACACTCC AGCCTCTCTC CGA 

8 

NR17 

AGCACACTCC AGCCTCT 

9 

RA24 

AGCACACTCC AGCCTCTCTC CGAA< 

10 

RC24 

AGCACACTCC AGCCTCTCTC CGAC 

11 

JA24 

ACCGACGTCG AATATCCATG CAGA 

12 

JC24 

ACCGACGTCG AATATCCATG CAGC 

13 

Dt-R 

AGCACACTCC AGCCTCTCTC CGA 

14 


AGCACACTCC AGCCTCTCTC CGATTTTTTT TTTTTTTTT t TTT 


5 EXAMPLES 

The invention will be further described in the following examples, which do not limit the 
scope of the invention described in the claims. Examples 1-6 collectively describe the synthesis 
and amplification of cDNA subfractions enriched for the 5* terminal sequences of mRNA 
molecules. Example 7 describes clone sizing. 

10 Example 1. 5 f cDNA Synthesis — phosphatase/pyrophosphate digestion 

For each reaction, 2.5 |ig mRNA (do not exceed 3 \ig total) is added to H 2 0 so as to 
provide a total volume of 73.5 \xl This mixture is then heated to 65°C for 10 minutes, and 
quick-cooled on ice. The CIAP Cocktail (see below) is made as follows: 

CIAP Cocktail : 

1 5 For each reaction: 1 0 \il 1 Ox CIAP buffer 1 1 0 jil 

2.5 |al RNasin (Promega) x 1 1 27.5 [il 

10 ^1 0.1 MDTT 110 |il 

4 \xl 0.01 Will CIAP* 35 til 


20 1 ) 26.5 |il of the above enzyme mixture is added to each 3 jul mRNA to give 
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a total volume of 30.5 \xL 73.5 jal of the RNA mix is then added to give a 

final volume of 100 



2) 
3) 

Incubate at 37°C for 40 minutes. 

Add 100 \A TE buffer (10 mM Tris pH 8.0; 0.1 mM EDTA). 

5- 

. 4 > 

•5) 

. Add 200; |il Acid-Phenol. . 
Mix vigorously. 


6) 

Add 200 y\ Chloroform-Isoamyl Alcohol (24: 1 v/v). 


7) 

Mix vigorously. 


8) 

Centrifuge in a microfuge at maximum speed for 10 minutes. 

10 

9) 

Remove supernatant and transfer to new tube. Discard bottom layer. 


10) 

Repeat steps 4-9 (only for GIAP treatment, not in later steps). 


11) 

Add 2 jal ssDNA carrier and 20 \i\ 3 M Sodium Acetate to each tube. 


12) 

Vortex 10 seconds and add 440 \x\ of absolute ethanol. 


13) 

Vortex 10 seconds and incubate at least 30 minutes at -80°C. 

15 

14) 

Centrifuge samples at 13,200 x g for 15 minutes. 


15) 

Wash nucleic acid nellets with 70% pthannl ^nH nir-H™ n^)\f*t 


16) 

Dissolve nucleic acid pellet in 70 ul water and cool on ice. 


17) 

Centrifuge for 10-15 seconds at maximum speed. 


18) 

Transfer contents of tubes to 8-strip tubes. 

20 

19) 

Add 30 jal TAP cocktail (see below). 


TAP Cocktail : 


For each reaction : 1 0 pi 1 Ox TAP buffer 1 1 0 ul 

2.5 ul RNasin x 1 1 27.5 ul 

15.5ulH 2 0 170:5 ul 

25 2.0 ul 10 U/ul TAP (Epicenter) 22 ul 


20) Add 30 ul of above mixture to each 70ul CIAP-treated sample for 
a total volume of 100 ul- 
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2 1 ) Incubate at 37°C for 45 minutes. 

22) Repeat Phenol/Chloroform extraction and precipitation as above in 
steps 6-9 and then 11-15 (do not resuspend pellet). 

Example 2. 5* cDNA Synthesis: DNA-RNA Hybrid Primer Ligation 
5 1) Transfer samples from Example 1 to 8-strip tubes. 

2) Resuspend pellet in Ligation Cocktail (see below). 

Ligation Cocktail : 

For each reaction: 3 |-U 1 0 mM ATP 33 (il 

10 1 ^1 RNasinx 11 1 1 M-I 

4.5jalH 2 0 49.5 nl 

2 |al R-BAP-TAP DNA/RNA hybrid oligomer 22 jai 


3) Add 1 0.5 |al of above mixture to each pellet, dissolve pellet completely at 
1 5 room temperature by (preferably) tapping the tube or vortexing if needed. 

4) Make an enzyme mix as follows: 

Enzvme Mixture : 

For each reaction: 30 fil H 2 0 330 \x\ 

12 nl 5x DNA Ligase Buffer (Life Tech) x 1 1 1 32 pi 

20 l.S^lRNasin 16.5 |al 

6 jal T 4 RNA Ligase (Life Tech.) 66 |il 


Total reaction volume 60 \il 
5) Incubate overnight at 20°C. 
25 6) Repeat Phenol/Chloroform and precipitation as above in CIP/TAP Cocktail 

protocol steps 6-9 and 11-15 (do not resuspend pellet). 

Example 3, 5' cDNA Synthesis: cDNA First-Strand Synthesis 

1) Resuspend cDNA pellet in Random Hexamer Cocktail (see below). 


20 


BNSDOCID: <WO 0040757A3JA> 


WO 00/40757 


PCT/US00/00402 


Random Hexamer Cocktail : 

For each reaction: 10 fil HX> x 1 1 110^x1 

0.5 \i\ random hexamer (dN 6 -5*-Phosphate ? 100 |iM) 5.5 |il 
5 fil Oligb-(dT) (dT 30 VN-5Thosphate, 100 ^M) 55 ^il 

2) Add 15.5 |il of above mixture to each tube and resuspend pellet. 

3) Heat at 70°C for 10 minutes and quick-cool on ice. 

4) Make First-Strand Synthesis Cocktail as follows (see below). 

First-Strand Synthesis Cocktail: 


For each reaction: 6 jj.1 5x First-Strand Buffer 66 jliI 

3 (Lil lOmMdNTPs 33 jil 

3 |il lOOmMDTTx 11 33 |il 

1 |il RNase Inhibitor 1 1 jal 


5) . Add 13 (al of the above mixture to each 15.5 jil sample to give a total volume 

of 28.5 nl. 

6) Incubate at 37°C for 2 minutes. 

7) Add 1.5 fil Superscript II RT to each reaction for a total volume of 30 fil. 

8) Incubate at 37°C for 10 minutes. 

9) Incubate at 42°C for 1 hour. 

10) Incubate at 16°C. 

1 1) Add 40 ^1 of the following DNA Ligase Mixture (see below) to each reaction 
tube for a total volume of 70 |ul. 

£ coli DNA Ligase Mixture : 

For each reaction: 4 ^1 lOx E. coli Ligase Buffer x 1 1 44 jlxI 

33^1H 2 0 330 ^1 

3 |il K coli DNA Ligase (10 U/|il) 33 jil 

12) Continue incubation at 16°C for 2 hours. 
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Example 4. 5' cDNA Synthesis: removal of non-ligated Primers 

While the above 2 hour incubation described in Example 3 is progressing, prepare 
one Boehringer-Mannheim Quick-Spin G-50 columns per reaction as follows: 

I) Mix the resin bed well by inverting the columns repeatedly. 

5 2) Remove the top cap first, and then the bottom cap. This avoids bubble formation 

and resultant poor performance of the spin-column. 

3) Stand column vertically and allow to drain completely. 

4) Add 0.75 ml of 10 mM Tris (pH 7.5) to the top of the bed without disturbing. 
If the bed becomes disturbed, pipette the solution up and down slowly to mix 

]0 the bed uniformly and allow the bed to re-settle so as to form a uniform surface. 

5) Stand column vertically and allow to drain completely. 

6) Place the columns into a 1 5 ml conical centrifuge tube with the vendor's 
associated collector tube beneath the spin-column to collect the sample. 

7) Centrifuge spin-column at 1000-1200 x g for 2 minutes. 

15 8) Remove spin-column with a forceps and remove the tube with flow through 

and discard. 

9) Carefully load the sample to the top center of the spin-column. 

10) Wash the sample tube with 20 H 2 0 and load on the same column. 

I I ) Place a new collection tube beneath each spin-column and centrifuge at 
20 1 000-1 200 x g for 4 minutes. 

12) ' Remove spin-columns and collect the flow-through into new, labeled tubes. 

13) Total sample volume will be approximately 105 
Example 5. 5' cDNA Synthesis: RNase (H, A, and T,) Treatment 

1) To each reaction described in Example 4 add Second-Strand Reaction Buffer (see 

25 below). 
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Second-Strand Reaction Buffer: 


For each reaction: 


3 ill 1 00 mM DTT 

6 yil First-Strand Buffer 

30 |il Second-Strand Buffer x 1 1 

6 |al H,0 


33 nl 
33^1 
330 ^1 
66 Hi 


10 


15 


20 


2) Add 45 ^1 of the above mixture to each 1 05 \il sample to give a total volume of 
150 hI. 

3) Add 2 (il of RNase H to each sample. 

4) Incubate at 37°C for 30 minutes to nick the RNA in RNA/DNA hybrids. 

5) Make an RNase Mixture comprising: 22^1 RNase H, 44 jal RNase Cocktail 
(Ambion; available as an RNase A and RNase T } mixture). 

6) Heat samples to 95°C for 2 minutes. 

7) Slow cool down to 37°C and continue incubation. 

8) Add 3 \i\ RNase Mixture to each of the cDNAs, mix by pipetting up and down. 

9) Continue incubation at 37°C for an additional 10 minutes. 

10) Heat samples to 95°C for 2 minutes. 

1 1) Slow cool down to 37°C and continue incubation. 

12) Add an additional 3 (il of RNase Mixture to each of the cDNAs, mix by 
pipetting up and down. 

13) Continue incubation at 37°C for an additional 15 minutes. 

14) Repeat Phenol/Chloroform extraction and precipitation as above in 
steps 6-9 and then 11-15. 

1 5) Dissolve pellet in 20 |il H 2 0. 

1 6) Remove a 5 \xl aliquot for Second-Strand (see below) synthesis for producing 
5'-cDNA for SeqCalling™ Chemistry Protocol. 
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Example 6. Second-Strand Synthesis for Producing 5 T -cDNA for SeqCalling™ Chemistry 

1) Generate PCR Mixture (see below) as follows: 

PCR Mixture: 

For each reaction: 5 lOx PCR Buffer x 1 1 55 |il 

1 |il lOmM dNTPs 5.5 jal 

1 10|iMR17 Primer 5.5 |al 

37.5|ulH 2 0 412,5 jil 

0.5 jal Advantage Polymerase 5.5 (il 


10 2) . Add 45 ^1 of the above mixture to each 5 p,l sample, for a total volume 50 fil. 

3) Heat samples as per protocol below, making sure that the sample tubes are placed 
in the thermocycler only after it has reached >80°C. 

94°C for 2 minutes | 
, 55°C for 2 minutes | xl Cycle ONLY 

l5 r 72°C for 60 minutes \ (Cycle designated KM- AD-2N) 

4°C for long-term storage 

4) Warm reaction tubes to 37°C. 

5) Make SAP Cocktail (see below) as follows 

SAP Cocktail: 


For each reaction: 1 2 ixl 1 Ox SAP Buffer x 1 1 1 32 fil 

5(ilH 2 0 55 ^1 

3 til Shrimp Alkaline Phosphatase (SAP; 1 U/^il) 33 jil 


6) Add 20 fil of SAP Cocktail to each reaction. 

25 7) Heat to 37°C for 30 minutes. 

8) Purify samples by Qiagen 96-well plate as manufacture's protocol. 

9) Elute cDNAs in 1 00 \il 1 OmM Tris-HCl buffer and proceed with fluorometry. 
Example 7. Clone Sizing 

SeqCalling™ Chemistry products generated in any of Examples 1-6 are diluted and re- 
30 amplified. Fractionation is then performed by electrophoresising the re-amplified sample on an 
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agarose gel using MetaPhor agarose (FMC). After the electrophoresis, the gel is physically cut 
into a total of 48 fractions. 24 of the fractions are derived from a 4% MetaPhor gel, and 
correspond to the lower molecular weight fractions; whereas the other 24 fractions derived from 
the 3% MetaPhor gel, correspond to the upper molecular weight fractions. 

Following the elution of the DNA from the gel fractions, the DNA fragments are ligated 
into a; vector with the TOPO-TA cloning vector (Invitrogen). These plasmids are then 
transformed into E. coli. The transformed bacterial cells are plated onto petri dishes and grown 
to a size that allows automated colony picking. A suitable number of colonies/fraction are 
selected so as to ensure a statistically accurate representation of the DNA fragments contained 
within the fraction suitable numbers of picked colonies/fraction are 48 or 96). Following 
the incubation of the selected clones, the fragment contained within each individual clone are 
sized using the proprietary MegaBACE system, or an equivalent. Sizing is performed with 
multiple clones/lane. This multiplexing allows sizing to be performed in a cost and time efficient 
manner. The multiplexing is performed with a liquid handling robot (e.g., Matrix PlateMate). 
After running the multiplexed fragments on MegaB^CE<)and correlating the size of the fragment 
with the E. coli clone containing the insert, the fragments are analyzed to determine suitability 
for sequencing. 

Example 8. Comparison of clone complexity with and without use of a sizing step 

20 The effect of using a clone sizing step on the complexity, i.e.. the representation of rarely 

transcripts, of the resulting clones, is shown in FIGS. 4A and 4B. In FIG. 4A. no sizing step was 
used, while clone sizing was used in the identification of the clones shown in FIG. 4B. Shown in 
the figures is a comparison of the frequencies (expressed in percentage) of clones derived from 
transcripts present at varying levels. The outer numbers represent the prevalence of a particular 

25 clone sequenced, and the inner numbers represents the percentages of the total number of clones 
sequenced that fall into this abundance class. As illustrated in FIG. 4A, the sequencing results 
that were obtained without the use of the sizing filter demonstrated that only a small percentage 
of the total number of fragments that were sequenced were included low copy number fragments 
(Le., singletons, duplicates, and triplicates). Specifically, singletons were found to comprise only 

30 2% of the total number of fragments sequenced, while fragments that were present at greater than 
51 copies comprised 38% of the total fragments sequenced. In contrast as illustrated in FIG. 4B, 
the sequencing results that were obtained with the use of the sizing filter were enriched for clones 
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from low abundance transcripts (i.e., singletons, duplicates, and triplicates). These clones 
constituted approximately 33% of the total fragments sequenced. In contrast; without the use of 
this sizing filter, these fragments were found to only comprised a total of 8% of the sequencing 
results. 


Equivalents 

Although particular embodiments have been disclosed herein in detail, this has been done 
by way of example for purposes of illustration only, and is not intended to be limiting with 
respect to the scope of the appended claims that follow. In particular, it is contemplated by the 
inventor that various substitutions, alterations, and modifications may be made to the invention 
without departing from the spirit and scope of the invention as defined by the claims. For 
example, the selection of the specific tissue(s) or cell line(s) that is to be utilized in the practice 
of the present invention is believed to be a matter of routine for a person of ordinary skill in the 
art with knowledge of the embodiments described herein. 
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1 . A method of screening a population of nucleic acids for a novel sequence, the method 
comprising: 

providing a population of nucleic acid sequences; 

partitioning said population into one or more subpopulations of nucleic acids; 
identifying a first nucleic acid sequence in the subpopulation of nucleic acid sequences; 

and 

comparing the first nucleic acid sequence to a reference nucleic acid sequence or 
sequences, wherein the absence of the first nucleic acid sequence in the reference nucleic acid or 
nucleic acid sequences indicates the first nucleic acid is a novel nucleic acid sequence. 

2. The method of claim 1 , wherein said DNA population is a cDNA population derived 
from a population of RNA molecules. 

3. The method of claim 2, further comprising partitioning the RNA molecules. 

4. The method of claim 2, wherein said cDNA population is derived from the 5" ends of the 
RNA molecules. 

5. The method of claim 2, wherein said cDNA population is derived from the interior 
regions of the RNA molecules. 

6. The method of claim 2, wherein said cDNA population is derived from the 3* ends of the 
DNA molecules. 

7. The method of claim 2, wherein said partitioning step comprises hybridization of a probe 
nucleic acid sequence to the population of nucleic acids. 

8. The method of claim 2, wherein said partitioning step comprises digesting the cDNA 
molecules with one or more restriction enzymes. 
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9. The method of claim 8. further comprising ligating adapter oligonucleotides to the 
termini of the digested cDNA molecules. 

10. The method of claim 9, further comprising amplifying the ligation products. 

1 1 . The method of claim 8. further comprising separating the amplified products. 

12. The method of claim 1 L wherein said separating is by gel electrophoresis. 

13. The method of claim 1 1 . wherein the first nucleic acid sequence is identified by 
comparing the size of one or more digestion products produced by a member of the 
subpopulation of nucleic acids to the sizes of fragments generated by the same restriction enzyme 
or enzymes in said reference nucleic acid or nucleic acids. 

14. The method of claim 1 1 . further comprising 
recovering one or more size-separated digestion products; 
reamplifying the recovered products; and 

separating the reamplified products. 

1 5. The method of claim 14. wherein said separating is by gel electrophoresis. 

16. The method of claim 15, wherein the first nucleic acid sequence is identified by 
comparing the size of one or more digestion products produced by a member of the 
subpopulation of nucleic acids to the sizes of fragments generated by the same restriction enzyme 
or enzymes in said reference nucleic acid or nucleic acids. 

17. The method of claim 9, further comprising: 

inserting the ligated adapter oligonucleotide into a cloning vector to form a vector-insert; 
transforming the vector-insert into a suitable host; 

culturing transformed host under conditions allowing for replication of the vector-insert; 

28 


BNSDOCID: <WO OO40757A3JA> 


WO 00/40757 PCT/USOO/00402 
recovering the vector-insert from said host; and 

digesting the vector-insert with one or more restriction enzymes, thereby releasing said 
insert; and 

comparing the size of the insert to sizes of fragments generated by the same restriction 
enzyme or enzymes in said reference nucleic acid or nucleic acids. 

1 8. The method of claim 1 , wherein comparing is by determining at least a portion of the 
nucleotide sequence of the first nucleic acid sequence and comparing the nucleotide sequence to 
the nucleotide sequence of one or more reference nucleic acids. 

19. The method of claim 1 . wherein comparing is by hybridizing the first nucleic acid 
sequence to one or more of the reference nucleic acid sequences. 

20. A method for equalizing the representation of nucleic acids in a population of nucleic 
acids, the method comprising: - ; ; - j — i 

providing a population of nucleic acid sequences, wherein said population comprises a 
first nucleic acid and a second nucleic acid having a nucleic acid sequence distinct from the first 
nucleic acid, and wherein said first nucleic acid is present at a higher level in said population 
than said second population; 

partitioning said population into one or more subpopulations of nucleic acids; and 

comparing the levels of said first nucleic acid sequence to the levels of said second 
nucleic acid sequence in the subpopulation of nucleic acid sequences, wherein a lower level of 
the first nucleic acid sequence relative to the second nucleic acid sequence indicates the 
representation of said first and second nucleic acid sequences are normalized. 

21 . A method for producing a population of nucleic acid molecules enriched for 5' regions of 
mRN A molecules, the method comprising: 

providing a population of RNA molecules, said population including RNA molecules 
having a 5' terminal Gppp cap structure and a 5' terminal phosphate group; 
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contacting said population of RNA molecules with a phosphatase under conditions that 
result in removal of the 5 ? terminal phosphate group while leaving the 5' terminal Gppp cap 
structure intact; 

inactivating said phosphatase; 

contacting the population of RNA molecules with a pyrophosphatase under conditions 
that result in the removal of the 5 ? terminal Gppp and the formation of a 5' phosphate group; 

annealing an oligonucleotide in the presence of an RNA ligase to form a hybrid molecule; and 

forming a cDNA from said oligonucleotide. 

22. A method of identifying an RNA sequence in a sample comprising a plurality of RNA 
sequences, the method comprising: 

synthesizing cDNA copies of a plurality of RNA species to form a cDNA sample; 

determining the size of one or more of said cDNA molecules in said cDNA sample; 

comparing the size of said sample with the size of a reference nucleic acid: and 
thereby identifying the cDNA sequence. 


23. The method of claim 22. wherein said cDNA molecules are digested with one or more 
restriction enzymes prior to the determining step. 

24. The method of claim 23, further comprising ligating adapter oligonucleotides to the 
termini of the digested cDNA molecules prior to the determining step. 


25. The method of claim 22. wherein said identifying step comprises comparing the size of 
one or more digestion products produced by one or more said cDNA molecules to a reference 
nucleic acid or nucleic acids. 


26. A method of identifying an RNA sequence in a population of RNA sequences, the 
method comprising: 

30 


BNSDOCID: <WO 0040757A3JA> 


WO 00/40757 PCT/USOO/00402 

(a) removing 5' terminal pppG from RNAs in said population to form a population of 
RNAs having terminal 5' phosphate groups; 

(b) ligating a linker oligonucleotide to the terminal 5' phosphate groups of RNA 
molecules in said population of RNAs; 

* (c). synthesizing complementary cDNA molecules from said population of RNA 
-molecules to form a cDNA sample; 

(d) digesting said complementary cDNA molecules with at least one restriction enzyme; 

(e) ligating an adapter molecule to the digested cDNA molecules; 

(f) amplifying the molecules produced in step (e); 

(g) identifying the amplified molecules of step (f); and 

(h) comparing the amplified molecules to one or more reference nucleic acids. 
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<150> 60/115,109 f 
<151> 1999-01-08 

<150> 09/417,386 
<151> 1999-10-13 

<160> 14 

<170> Patentln Ver. 2.0 

<210> 1 
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<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 
<400> 1 

ctctccgatg caggtggc 18 

<210> 2 
<211> 46 
<212> DNA 

<213> Artificial Sequence 

<220> ? 

<223> Description of Artificial Sequence: PCR primer 

<400> 2 

agcacactcc agcctctctc cgagcacatg cgacactgag cactac 46 

<210> 3 
<211> 46 
<212> DNA 

<213> Artificial Sequence 
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<151> 1999-10-13 

<160> 14 

<170> Patent In Ver. 2.0 

<210> 1 

<211> 18 

<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 

<400> 1 

ctctccgatg caggtggc 

<210> 2 
<211> 46 
<212> DNA 

<213> Artificial Sequence 

<220> * 

<223> Description of Artificial Sequence: PCR primer 

<400> 2 

agcacactcc agcctctctc cgagcacatg cgacactgag tactac 
<210> 3 

* 

<211> 46 
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<213> Artificial Sequence 
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<212> DNA 

^213 > Artificial Sequence 
<220> 

<223> Description of Artificial Sequence: PCR primer 
<400> 14 

agcacactcc agcctctctc cgattttttt tttttttttt ttt 43 
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